Subscribe

Author Archive

SPARQL Hacks: moving query logic into data

There are too many terms that mean the same thing sometimes. Take labels. rdfs:label is perhaps the most obvious choice if you want to label something in RDF, but there are a whole bunch of semantically equivalent predicates in high usage for doing the same thing. For a while, it seems, it was common practice for every vocabulary to define their own equivalent – though very few bother to rdfs:subPropertyOf rdfs:label (and some predate rdfs:label), so even if you can do some reasoning in your query engine, this might not help you much. So when you want to get the label for something, but you don’t know which predicate the data uses, you might end up doing something like this:


construct { ?s rdfs:label ?l }
where
{
?s ?p ?o
optional
{ ?s rdfs:label ?l }
optional
{ ?s foaf:name ?l }
optional
{ ?s sioc:name ?l }
optional
{ ?s dc:title ?l }
optional
{ ?s dcterms:title ?l }
}

Nasty. And maybe later you find another label predicate in the data somewhere and have to go modify your queries.

But, if I add these triples to my store:


<#a> rdfapp:labelPredicate dc:title, rdfs:label, dcterms:title foaf:name, sioc:name .

I can instead do:


prefix rdfapp: <http://kwijibo.talis.com/vocabs/rdfapp#>
construct { ?s rdfs:label ?l }
where
{
<#a> rdfapp:labelPredicate ?labelPredicate .
?s ?labelPredicate ?l .
}

voiD stores and Interesting Queries

Amongst the best incentives for data authors are applications that use that data. One sort of data that especially interests me is dataset metadata, for which the voiD vocabulary was developed; I think this kind of data has the potential to enable the future generation of web apps to join together the ever-growing web of data in wild and exciting new ways. So I was pretty pleased when I saw the voiD store from RKB Explorer. This store provides a SPARQL endpoint over all the voiD descriptions RKB Explorer have produced about their datasets, plus some descriptions they’ve gathered about other datasets. It also provides a list of source documents, sample queries, and a service that takes a list of URIs, and returns a list of SPARQL endpoints that might be able to return triples about them.

This, together with a rainy weekend, prompted me to try out some simple voiD-related things I’d been thinking of. I’ve also been aggregating voiD data in one of my dev stores. This is done partly by creating templated descriptions from a list of Talis Platform stores and poking at them with some SPARQL queries. The rest of the data I found either manually, or by querying Sindice for a list of void:Dataset URIs found in the documents they’ve crawled.

The Sindice API allows you to specify triple patterns with wildcards, and will return you an Atom feed: * rdf:type void:Dataset . I page through the results, importing the RDF from the URIs into my store.

One of my favourite terms from voiD is void:uriRegexPattern, which can be used to indicate that if a URI matches the pattern, the dataset might contain some triples about that URI. You can do this with a bit of SPARQL:

    
prefix void: <http://rdfs.org/ns/void#>
DESCRIBE ?dataset {
     ?dataset void:uriRegexPattern ?regex ; void:sparqlEndpoint ?sparql ; a void:Dataset .

    FILTER(REGEX("http://example.com/my/uri", ?regex))
}

    

The novel thing here is that normally, when you use REGEX() in SPARQL, you put a variable binding in the first parameter position, and hardcode a regular expression into the query in the 2nd position. Here though, the regex is in the data, and it is the string against which it is evaluated which is hardcoded, and the variable binding contains the regex. (Unfortunately, while this works with ARQ, it doesn’t appear to work with 3Store – which is perhaps why the rkbexplorer voiD Store provides this as a separate web service).

So, I’ve used this to create a page that will take a URI, and query my voiD store for void:sparqlEndpoints and void:uriLookupEndpoints, which it will then call to retrieve triples and render them on the page. Here is a query for the URI http://climb.dataincubator.org/dataset .

Another query that interested me, which has become possible since the Platform introduced support for the COUNT() function from SPARQL 1.1, is, which are the most commonly used vocabularies? (SIOC and FOAF so far! – thought this is because I generated many of these triples based on scripted prodding of endpoints with ASK queries) But then I wanted to be able to see easily which datasets used which vocabularies, so I created some pages to let me browse datasets by vocabulary.

  1. SIOC Core Ontology Namespace(54)
  2. Friend of a Friend (FOAF) vocabulary(42)
  3. Coreference Ontology (35)
  4. http://www.aktors.org/ontology/portal# (34)

  5. http://www.aktors.org/ontology/support# (30)
  6. http://www.rkbexplorer.com/ontologies/resist# (30)
  7. void (25)
  8. http://purl.org/NET/scovo# (24)
  9. http://acm.rkbexplorer.com/ontologies/acm# (22)
  10. http://courseware.rkbexplorer.com/ontologies/courseware# (21)

Then I made some pages to do the same thing with dct:subjects. Here, the largest category by some way, is category: online_social_networking. This is because I generated ?dataset dct:subject <http://dbpedia.org/resource/Category:Online_social_networking> . triples automatically for all the platform stores which made a certain use of terms from the SIOC ontology.

These automatically generated voiD descriptions will not, of course, present such a balanced picture of what is out there, and skew the results somewhat. The most interesting descriptions are those which are handcrafted to some extent, describing something of the nature of the dataset’s domains.

I’ve also provided a form for submitting voiD URLs to. My hope is that this simple application, together with the rkbexplorer voiD Store, might encourage more people to describe their linked data datatsets with voiD, or perhaps add more detail to the descriptions they already publish, in order to see their dataset come up in the appropriate queries. And I hope that this, in turn, will encourage others to build more sophisticated and exciting applications using that data.

Vocamp Glasgow 2009

This week saw the first Vocamp in Scotland, held at the University of Strathclyde, Glasgow.

Vocamp Glasgow 2009

Attendees came from a wide range of different and interesting problem-spaces and domains and gave a lot of great presentations on their work. The range was too broad, perhaps, for us to find enough commonality to collaborate on creating/fixing any vocabularies (the focus of the previous vocamps I’ve attended), but it was great to have together so many people with an interest in the semantic web in the locality, and the presentations were all really good.

Jeff Pan and Edward Thomas from Aberdeen University presented some great tutorials that covered a lot of ground, from RDFa, OWL2 and data-modeling methodology with Protegè.
Jeff Pan on OWL 2. (I especially liked the slide explaining how machines understand markup.)

Norman Gray and Stuart Chalmers presented their work on creating SKOS mappings between astronomy vocabularies.

Norman Gray on vocabulary mapping with SKOS

Jenny Ure from Edinburgh University talked about some of her work on the Socio-technical aspect of collaborative ontologies and knowledge systems.

Jenny Ure

Peter Winstanley talked about some of the data curated by the Scottish Government, and showcased Semantic Mediawiki for ontology development, and some different options for ontology visualisation.

Peter also pointed to the Communities Of Practice for local Government Scottish Group: Shared Representation using Semantic Technologies , inviting anyone with an interest in Semantic technologies to join and contribute to the discussion forums.

Peter Winstanley on Ontology visualisation and Scottish Gov Data

Serge Boucher from Brussels talked about some of the exciting possibilities for location and context-aware semantic web services.

Serge Boucher on Location Based Semantic Services

Gordon Dunsire from the Centre for Digital Library Research presented on vocabularies, standards, and linked data in the library domain, making particular mention of the dramatic tale of the development of the Library of Congress Subject Headings Dataset.

Gordon Dunsire on  Linked Data, vocabularies, and library metadata

Martin Dempster from University of Dundee presented his research into Assistive Technologies helping people that have difficulties talking to communicate, his use of ontologies to manage the data in his prototype system, and consuming data from popular social web 2.0 sites to generate conversational choices.

Martin Dempster on Semantic enhanced Assistive Technology

The event was hosted and facilitated by Paola Di Maio from the University of Strathclyde; thanks to Paola for organising the event, the university for laying on wifi and tea and coffee, and Talis for sponsoring the lunches.

Data Migration using SPARQL and Changesets

A tagline sometimes used for RDF is “self-describing data”. Sometimes though, you make your data describe itself badly; perhaps you’ve used a vocabulary term that has since been deprecated, or perhaps you’ve found a term which is more widely supported, or more appropriate to the data; maybe there was a typo in the script you generated your triples with. At any rate, it’s pretty common to have to fix your data, and if you have a live application with fresh ‘bad’ triples being created all the time, and a lot of bad triples to fix anyway, this can get tricky.

We’re having to do this in a project at the moment, and this is the method Nad and I came up with. We separated out adding good triples and removing bad triples into separate stages because our application will continue to function the same with both good and bad triples, but, once we rollout the code that expects the good triples, we need the good triples, whereas the bad triples can be removed at our leisure.

Adding New Good Triples

So, say for example, one of the things we want to fix is using dcterms:creator instead of dc:creator. We can get the good triples by querying for:

    CONSTRUCT {
    # good
                ?s <http://purl.org/dc/terms/creator> ?o
    } WHERE {
    # bad
                ?s <http://purl.org/dc/elements/1.1/creator> ?o
    }

And posting that back into the store. (First make sure that your application won’t do anything too weird if you add in these triples without changing any code).

If there are a lot of triples in the store, you may not be able to retrieve and post them all at once. To scale to large numbers of triples, just page through the results at, say, 1000 triples at a time by adding LIMIT 1000 OFFSET 0 and incrementing the OFFSET by the LIMIT (1000) until you don’t get triples back anymore.

Wrap this little procedure up in a script because you’ll need to run it again.

Deploy Code

As soon as you have finished adding the new /good/ triples, deploy your new code that uses dcterms:creator instead of dc:creator.

Now run the add good triples script again. This is because, while you were deploying the code, users may have been plugging away at your app, happily creating more bad old triples. Running the script again will add good triples for any of these bad triples that have been created meantime. And because you’ve now deployed your code changes, the application won’t create any more bad triples.

All we have left to do is get rid of the bad old triples. With any luck (and a bit of foresight, and testing), your application will function perfectly well with both bad and good triples in the store, so we can take our time a bit getting rid of the bad triples.

Removing Bad Old Triples

We’ll write a SPARQL query to give us back the triples we want to remove, and then we’ll create a Changeset to remove them:

    CONSTRUCT {
    # bad
             ?s <http://purl.org/dc/elements/1.1/creator> ?o
    } WHERE {

    # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
    # good
            ?s <http://purl.org/dc/terms/creator> ?o

    }

(you should apply a LIMIT, but, so long as you are waiting for each changeset batch to succeed before sending the next one, you don’t need to page – just get back the first 1000 until you don’t get anything back. It’s worth remembering that the number of triples in a changeset document will be about ten-fold the number of triples you are removing, so you may need to make the limit a bit smaller than before).

An important point on the platform’s Changesets API: if you send more than 14 changesets in a batch (ie, in the same document), they will be performed asynchronously and you should get back an HTTP 202 Accepted status code. A potential problem is that, if you are trying to remove a statement that doesn’t exist then all the changes in that batch will fail, but you will still get back a 202 Accepted (because the platform hasn’t tried processing them yet). You need a way of knowing if the batch has failed or not. One way to do this is to include in your batch of changes, the addition of a triple you can then poll the store for to see if it exists or not.

If you’re using PHP, you can use Moriarty to create your changesets:

    
#php
define('STORE_URI', 'http://api.talis.com/stores/sandbox1');
$markerURI =  STORE_URI.'/items/'.time();
$time = time();

$rdfToAdd = " <{$markerURI}> <http://purl.org/dc/terms/created> \"{$time}\" . ";

$args = array(
    'before' => $rdfToRemove, // got this from the CONSTRUCT described above
    'after' => $rdfToAdd, // this is the marker triple
);
$cs = new Changeset($args);
$store = new Store(STORE_URI); // you will probably need to add your login credentials - see the moriarty docs
$response = $store->get_metabox()->apply_changeset($cs);

if(!$response->is_success()){
    //log error, and stop
    log_error("Changeset failed: ".$response->status_code ."\n " . $response->body ." \n  Changeset: \n" . $cs->to_rdfxml());
    break;
}

    

At this point, you may also want to poll the store for the existence of your marker triple to see if the batch has been processed. Since we minted the marker URI in the store’s URI space, we can just try to dereference it; as soon as we get back a 200 or 303 response, we can move on, but if we still get back a 404 after say 10 seconds, the changeset has probably failed and we need to log that and investigate.

If everything goes OK, you can then make double sure you’ve got rid of all the bad triples by running a quick ASK query against your store’s SPARQL service.

    
        ASK {
            # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
        }
    

And if you get back FALSE, you’re finished.

Well done!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) – grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm – too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

Vocabify: Instance Data -> Vocab

One thing about writing RDF vocabularies that occurred to me listening to people talk at VoCamps (Oxford and Galway), is that typically what you are trying to do isn’t defining new terms, it’s modeling data, and at some stage in the modeling you discover you need to write a new vocabulary. Vocabulary authors often want to describe how their terms can best be used with existing complimentary vocabularies, like FOAF and Dublin Core, but the only commonly practiced way of doing so is to put it in human-readable form in the documentation annotations. In voiD, we wrote a guide, principally because we wanted to describe how the terms ought to be used together with existing vocabulary terms.

In tandem with this thought, when sketching out vocabularies myself, I tend not to start out by defining Classes and Properties, which is both tediously repetitive, and a step removed from the data-modeling (which is what I’m actually trying to do in the first place). Instead, I define a prefix for a new namespace, and pretend a vocabulary already exists at it. Probably quite a lot of people do this. I think of them as “pretend schemas“; I’ve heard ldodds call them “just in time schemas” (only bother to write it when someone actually asks to see it).

So last night I coded up Vocabify, which you can feed some instance data that uses your “just in time vocabulary“, tell it which namespace URI is the pretend one, and it will generate a schema from the instance data, which you can then edit and publish.

The classes and properties are also linked to the instances they are generated from with ov:exampleResource, so it is clear to readers how they can be used together with other properties.

Metamorph Open Source project for Semantic Converter Web Service

I’ve published the code behind the Talis Convert Service (production release at stable URL coming soon) as an open source project on Google Code, called Metamorph .

Metamorph is a service aimed at semantic web developers. It is much like triplr, babel, swignition and any23 (please leave a comment pointing to any other similar services).

You give it a(n http) URI, an (optional) input format, and an output format, and it will fetch the document from the web, and convert it into the output format.

Understood input values include:

  • Semantic HTML (RDFa, eRDF, microformats, POSH)
  • RDF (XML, Turtle, JSON)
  • SPARQL-XML
  • Facet XML (the response format of the facets service available on all platform stores)

Output for all input formats can be:

  • JSON
  • JSONP
  • HTML

If the input is some form of RDF, you can also ask for:

  • RDF (XML, Turtle, JSON, – and the default HTML is rendered as RDFa)
  • RSS 1.0
  • TriX
  • Exhibit (web page, JSON, JSONP)

In addition, if the input is an RDF format, you can specify multiple data URIs, and the results will be merged in the output document. For instance, this conversion merges data from two of my homepages, and a Turtle file.

I’m thinking about removing the TriX output, as I’m not sure it would be used by anyone – the reason I didn’t bother to write a parser for it was because I haven’t seen any data published as TriX in the first place.

I welcome any input on what else would be useful from this web service. I suspect that more output options, while fairly easy to add, would not be very useful. More input options may be useful, but perhaps not significantly so.

I suspect what might be more useful, and more likely to distinguish this from similar RDF converter services, are graph transformation services, which might include:

  • Diffs
  • Intersects
  • Smushing
  • Augmenting on property and class type URIs with labels and comments, perhaps retrieved from SchemaCache

Metamorph is coded in PHP, and uses ARC for parsing RDF and HTML, and serialising RDF/XML and Turtle.

Please use the issue tracker for raising any bugs or feature requests.

voiD: a Vocabulary of Interlinked Datasets

As technological advances allow the production and dissemination of information to scale out, old methods for navigating the information become inadequate, and we need new means to cope with the greater scale of information available.

With the rise of printing in the 16th century, library collections flourished, making more ideas and information available to more scholars than ever before. Yet to know what books a library contained, scholars had to either physically visit the library (and browse the shelves, or consult a manuscript catalogue), or make enquiries by letter.

Frontpiece of the first printed library catalogue

In 1595, Leiden University innovated by becoming the first institution to make their library’s catalogue available in print. Just as printing had made the editions within a library far more widely available, printing a book about the library’s collection, brought awareness of the library and its contents to a greater audience. Now, scholars all across Europe could tell if Leiden University’s library had the information they needed. Scholars had more information about what books were available, and Leiden’s international reputation was bolstered. Other libraries followed suit by printing their own catalogues, and those library catalogues could be collected. Scholars could compare the strengths and purposes of multiple libraries from a single location.

When the Linked Open Data movement began gaining ground in 2007, there were relatively few large RDF datasets available on the web. If you followed the right blogs and mailing lists, you knew which datasets were available. As the LOD Cloud grows (and manually drawing it becomes less and less practical), it becomes apparent that the number of datasets is outgrowing our methods for discovering them. Just as it made sense for libraries in the 16th century to use the technology of print to publish descriptions of their collections, it is natural to use RDF to publish descriptions of datasets available on the web. Just as printed catalogues brought library collections to new audiences, and enabled new uses, RDF descriptions will bring datasets to new audiences (machines!), making them more findable, and enabling new uses. All you need is the vocabulary to describe datasets with.

voiD interlinking dataset diagram

voiD is a vocabulary dataset publishers can use to describe their datasets: their subject areas, their access mechanisms (eg: APIs, SPARQL endpoints, data dumps), their licensing, their provenance, how they link to other datasets, which vocabularies are used within them, and statistics relating to their contents.

As well as the vocabulary, there is the voiD guide, where the authors of voiD (Jun Zhao, Michael Hausenblas, Richard Cyganiak, and myself [Keith Alexander] ) explain how to create voiD descriptions combining terms from voiD with other useful vocabularies, publish voiD, and query voiD.

Feedback on both the vocabulary, and the Guide, will be gratefully received at void-rdfs-internals@googlegroups.com.