Subscribe

voiD stores and Interesting Queries

Amongst the best incentives for data authors are applications that use that data. One sort of data that especially interests me is dataset metadata, for which the voiD vocabulary was developed; I think this kind of data has the potential to enable the future generation of web apps to join together the ever-growing web of data in wild and exciting new ways. So I was pretty pleased when I saw the voiD store from RKB Explorer. This store provides a SPARQL endpoint over all the voiD descriptions RKB Explorer have produced about their datasets, plus some descriptions they’ve gathered about other datasets. It also provides a list of source documents, sample queries, and a service that takes a list of URIs, and returns a list of SPARQL endpoints that might be able to return triples about them.

This, together with a rainy weekend, prompted me to try out some simple voiD-related things I’d been thinking of. I’ve also been aggregating voiD data in one of my dev stores. This is done partly by creating templated descriptions from a list of Talis Platform stores and poking at them with some SPARQL queries. The rest of the data I found either manually, or by querying Sindice for a list of void:Dataset URIs found in the documents they’ve crawled.

The Sindice API allows you to specify triple patterns with wildcards, and will return you an Atom feed: * rdf:type void:Dataset . I page through the results, importing the RDF from the URIs into my store.

One of my favourite terms from voiD is void:uriRegexPattern, which can be used to indicate that if a URI matches the pattern, the dataset might contain some triples about that URI. You can do this with a bit of SPARQL:

    
prefix void: <http://rdfs.org/ns/void#>
DESCRIBE ?dataset {
     ?dataset void:uriRegexPattern ?regex ; void:sparqlEndpoint ?sparql ; a void:Dataset .

    FILTER(REGEX("http://example.com/my/uri", ?regex))
}

    

The novel thing here is that normally, when you use REGEX() in SPARQL, you put a variable binding in the first parameter position, and hardcode a regular expression into the query in the 2nd position. Here though, the regex is in the data, and it is the string against which it is evaluated which is hardcoded, and the variable binding contains the regex. (Unfortunately, while this works with ARQ, it doesn’t appear to work with 3Store – which is perhaps why the rkbexplorer voiD Store provides this as a separate web service).

So, I’ve used this to create a page that will take a URI, and query my voiD store for void:sparqlEndpoints and void:uriLookupEndpoints, which it will then call to retrieve triples and render them on the page. Here is a query for the URI http://climb.dataincubator.org/dataset .

Another query that interested me, which has become possible since the Platform introduced support for the COUNT() function from SPARQL 1.1, is, which are the most commonly used vocabularies? (SIOC and FOAF so far! – thought this is because I generated many of these triples based on scripted prodding of endpoints with ASK queries) But then I wanted to be able to see easily which datasets used which vocabularies, so I created some pages to let me browse datasets by vocabulary.

  1. SIOC Core Ontology Namespace(54)
  2. Friend of a Friend (FOAF) vocabulary(42)
  3. Coreference Ontology (35)
  4. http://www.aktors.org/ontology/portal# (34)

  5. http://www.aktors.org/ontology/support# (30)
  6. http://www.rkbexplorer.com/ontologies/resist# (30)
  7. void (25)
  8. http://purl.org/NET/scovo# (24)
  9. http://acm.rkbexplorer.com/ontologies/acm# (22)
  10. http://courseware.rkbexplorer.com/ontologies/courseware# (21)

Then I made some pages to do the same thing with dct:subjects. Here, the largest category by some way, is category: online_social_networking. This is because I generated ?dataset dct:subject <http://dbpedia.org/resource/Category:Online_social_networking> . triples automatically for all the platform stores which made a certain use of terms from the SIOC ontology.

These automatically generated voiD descriptions will not, of course, present such a balanced picture of what is out there, and skew the results somewhat. The most interesting descriptions are those which are handcrafted to some extent, describing something of the nature of the dataset’s domains.

I’ve also provided a form for submitting voiD URLs to. My hope is that this simple application, together with the rkbexplorer voiD Store, might encourage more people to describe their linked data datatsets with voiD, or perhaps add more detail to the descriptions they already publish, in order to see their dataset come up in the appropriate queries. And I hope that this, in turn, will encourage others to build more sophisticated and exciting applications using that data.

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

voiD: a Vocabulary of Interlinked Datasets

As technological advances allow the production and dissemination of information to scale out, old methods for navigating the information become inadequate, and we need new means to cope with the greater scale of information available.

With the rise of printing in the 16th century, library collections flourished, making more ideas and information available to more scholars than ever before. Yet to know what books a library contained, scholars had to either physically visit the library (and browse the shelves, or consult a manuscript catalogue), or make enquiries by letter.

Frontpiece of the first printed library catalogue

In 1595, Leiden University innovated by becoming the first institution to make their library’s catalogue available in print. Just as printing had made the editions within a library far more widely available, printing a book about the library’s collection, brought awareness of the library and its contents to a greater audience. Now, scholars all across Europe could tell if Leiden University’s library had the information they needed. Scholars had more information about what books were available, and Leiden’s international reputation was bolstered. Other libraries followed suit by printing their own catalogues, and those library catalogues could be collected. Scholars could compare the strengths and purposes of multiple libraries from a single location.

When the Linked Open Data movement began gaining ground in 2007, there were relatively few large RDF datasets available on the web. If you followed the right blogs and mailing lists, you knew which datasets were available. As the LOD Cloud grows (and manually drawing it becomes less and less practical), it becomes apparent that the number of datasets is outgrowing our methods for discovering them. Just as it made sense for libraries in the 16th century to use the technology of print to publish descriptions of their collections, it is natural to use RDF to publish descriptions of datasets available on the web. Just as printed catalogues brought library collections to new audiences, and enabled new uses, RDF descriptions will bring datasets to new audiences (machines!), making them more findable, and enabling new uses. All you need is the vocabulary to describe datasets with.

voiD interlinking dataset diagram

voiD is a vocabulary dataset publishers can use to describe their datasets: their subject areas, their access mechanisms (eg: APIs, SPARQL endpoints, data dumps), their licensing, their provenance, how they link to other datasets, which vocabularies are used within them, and statistics relating to their contents.

As well as the vocabulary, there is the voiD guide, where the authors of voiD (Jun Zhao, Michael Hausenblas, Richard Cyganiak, and myself [Keith Alexander] ) explain how to create voiD descriptions combining terms from voiD with other useful vocabularies, publish voiD, and query voiD.

Feedback on both the vocabulary, and the Guide, will be gratefully received at void-rdfs-internals@googlegroups.com.