Subscribe

Author Archive

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) - grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm - too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

Vocabify: Instance Data -> Vocab

One thing about writing RDF vocabularies that occurred to me listening to people talk at VoCamps (Oxford and Galway), is that typically what you are trying to do isn’t defining new terms, it’s modeling data, and at some stage in the modeling you discover you need to write a new vocabulary. Vocabulary authors often want to describe how their terms can best be used with existing complimentary vocabularies, like FOAF and Dublin Core, but the only commonly practiced way of doing so is to put it in human-readable form in the documentation annotations. In voiD, we wrote a guide, principally because we wanted to describe how the terms ought to be used together with existing vocabulary terms.

In tandem with this thought, when sketching out vocabularies myself, I tend not to start out by defining Classes and Properties, which is both tediously repetitive, and a step removed from the data-modeling (which is what I’m actually trying to do in the first place). Instead, I define a prefix for a new namespace, and pretend a vocabulary already exists at it. Probably quite a lot of people do this. I think of them as “pretend schemas“; I’ve heard ldodds call them “just in time schemas” (only bother to write it when someone actually asks to see it).

So last night I coded up Vocabify, which you can feed some instance data that uses your “just in time vocabulary“, tell it which namespace URI is the pretend one, and it will generate a schema from the instance data, which you can then edit and publish.

The classes and properties are also linked to the instances they are generated from with ov:exampleResource, so it is clear to readers how they can be used together with other properties.

Metamorph Open Source project for Semantic Converter Web Service

I’ve published the code behind the Talis Convert Service (production release at stable URL coming soon) as an open source project on Google Code, called Metamorph .

Metamorph is a service aimed at semantic web developers. It is much like triplr, babel, swignition and any23 (please leave a comment pointing to any other similar services).

You give it a(n http) URI, an (optional) input format, and an output format, and it will fetch the document from the web, and convert it into the output format.

Understood input values include:

  • Semantic HTML (RDFa, eRDF, microformats, POSH)
  • RDF (XML, Turtle, JSON)
  • SPARQL-XML
  • Facet XML (the response format of the facets service available on all platform stores)

Output for all input formats can be:

  • JSON
  • JSONP
  • HTML

If the input is some form of RDF, you can also ask for:

  • RDF (XML, Turtle, JSON, - and the default HTML is rendered as RDFa)
  • RSS 1.0
  • TriX
  • Exhibit (web page, JSON, JSONP)

In addition, if the input is an RDF format, you can specify multiple data URIs, and the results will be merged in the output document. For instance, this conversion merges data from two of my homepages, and a Turtle file.

I’m thinking about removing the TriX output, as I’m not sure it would be used by anyone - the reason I didn’t bother to write a parser for it was because I haven’t seen any data published as TriX in the first place.

I welcome any input on what else would be useful from this web service. I suspect that more output options, while fairly easy to add, would not be very useful. More input options may be useful, but perhaps not significantly so.

I suspect what might be more useful, and more likely to distinguish this from similar RDF converter services, are graph transformation services, which might include:

  • Diffs
  • Intersects
  • Smushing
  • Augmenting on property and class type URIs with labels and comments, perhaps retrieved from SchemaCache

Metamorph is coded in PHP, and uses ARC for parsing RDF and HTML, and serialising RDF/XML and Turtle.

Please use the issue tracker for raising any bugs or feature requests.

voiD: a Vocabulary of Interlinked Datasets

As technological advances allow the production and dissemination of information to scale out, old methods for navigating the information become inadequate, and we need new means to cope with the greater scale of information available.

With the rise of printing in the 16th century, library collections flourished, making more ideas and information available to more scholars than ever before. Yet to know what books a library contained, scholars had to either physically visit the library (and browse the shelves, or consult a manuscript catalogue), or make enquiries by letter.

Frontpiece of the first printed library catalogue

In 1595, Leiden University innovated by becoming the first institution to make their library’s catalogue available in print. Just as printing had made the editions within a library far more widely available, printing a book about the library’s collection, brought awareness of the library and its contents to a greater audience. Now, scholars all across Europe could tell if Leiden University’s library had the information they needed. Scholars had more information about what books were available, and Leiden’s international reputation was bolstered. Other libraries followed suit by printing their own catalogues, and those library catalogues could be collected. Scholars could compare the strengths and purposes of multiple libraries from a single location.

When the Linked Open Data movement began gaining ground in 2007, there were relatively few large RDF datasets available on the web. If you followed the right blogs and mailing lists, you knew which datasets were available. As the LOD Cloud grows (and manually drawing it becomes less and less practical), it becomes apparent that the number of datasets is outgrowing our methods for discovering them. Just as it made sense for libraries in the 16th century to use the technology of print to publish descriptions of their collections, it is natural to use RDF to publish descriptions of datasets available on the web. Just as printed catalogues brought library collections to new audiences, and enabled new uses, RDF descriptions will bring datasets to new audiences (machines!), making them more findable, and enabling new uses. All you need is the vocabulary to describe datasets with.

voiD interlinking dataset diagram

voiD is a vocabulary dataset publishers can use to describe their datasets: their subject areas, their access mechanisms (eg: APIs, SPARQL endpoints, data dumps), their licensing, their provenance, how they link to other datasets, which vocabularies are used within them, and statistics relating to their contents.

As well as the vocabulary, there is the voiD guide, where the authors of voiD (Jun Zhao, Michael Hausenblas, Richard Cyganiak, and myself [Keith Alexander] ) explain how to create voiD descriptions combining terms from voiD with other useful vocabularies, publish voiD, and query voiD.

Feedback on both the vocabulary, and the Guide, will be gratefully received at void-rdfs-internals@googlegroups.com.

paggr wins at ISWC

Benjamin Nowack, Semantic Web developer and innovator par excellence, and author of the ARC RDF library for PHP (which we have mentioned on this blog more than once), deservedly won the ISWC2008 Semantic Web Challenge for his application: paggr .

Paggr uses Benji’s scripting language extension to SPARQL “SPARQL SCRIPT“, to define widgets which can pull in semantic data from sources across the web, mesh it up, and render it on the page.

Well done Benji! We’re all looking forward to the public beta :)

paggr wins the semantic web challenge 2008

Authoring RDF data with SPARQL

Yesterday Yves Raimond and I presented a tutorial at WOD-PD where we created some turtle data and used my online semantic converter tool to convert the data to RDF/XML and POST it to the platform store we set up for the tutorial (wod-pd-sandbox).

In fact though, every SPARQL endpoint that supports CONSTRUCT is already a turtle -> rdf/xml converter. You can write Turtle with no variables in the CONSTRUCT graph, leave the WHERE graph pattern empty, and you will get back RDF/XML.

eg:

PREFIX ex: <http://example.org/>
CONSTRUCT {
  ex:Jimmy ex:eat ex:World .
}
 WHERE {}

returns

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:ex="http://example.org/" >
  <rdf:Description rdf:about="http://example.org/Jimmy">
    <ex:eat rdf:resource="http://example.org/World"/>
  </rdf:Description>
</rdf:RDF>

You can also use CONSTRUCT to create new data inferred from existing data. For instance, I wanted to add some triples about the conference, and I knew that everyone in the store with a URI in the store’s own namespace had been following the tutorial, and so was also attending the conference. So I made this query, and then POSTed the results into the store:

           PREFIX schema: <http://api.talis.com/stores/wod-pd-sandbox/items/Schema/>
	PREFIX sandbox: <http://api.talis.com/stores/wod-pd-sandbox/items/Things/>
	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
	PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
           PREFIX owl: <http://www.w3.org/2002/07/owl#>

	CONSTRUCT { 

		schema:Conference a rdfs:Class ;
		rdfs:isDefinedBy schema: ;
		rdfs:label "Conference" .

		schema:startDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "start date" .

		schema:endDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "end date" .

		schema:attendee a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attendee" ; owl:inverseOf schema:attended .

		schema:attended a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attended"; owl:inverseOf schema:attendee .

		sandbox:WOD-PD a schema:Conference ;
		           rdfs:label "Web of Data" ;
		           schema:startDate "2008-10-22" ;
		           schema:endDate "2008-10-23" ;
					   schema:attendee ?person .
		?person schema:attended sandbox:WOD-PD .
}  WHERE
{
	?person a <http://xmlns.com/foaf/0.1/Person> .

           FILTER(REGEX(STR(?person), "sandbox/items/People/"))
}

I used PREFIX to declare a prefix for a couple of namespaces with the store’s contentbox URIs - this meant that these URIs would dereference and work as Linked Data - 303ing to their RDF descriptions. This is a really nice feature of the platform, and makes it easy to mint new URIs that will play nice on the semantic web.

You might also have noticed that there are some new properties and classes defined there in the CONSTRUCT. This isn’t absolutely ideal - there is no documentation, and the terms are unlikely to be used again - but on the other hand, the descriptions are dereferencable according to the principles of linked data, and just as persistent as the data they describe. Moreover, as Richard Cyganiak said today - if you worry about doing RDF ‘right’ to the extent that it stops you doing RDF, you’re not doing it right.

Store Admin Interface

If you have a Talis store, or even if you’re just interested in browsing around existing talis stores, you might be interested in an admin interface  I’ve been working on.

Once you have selected a store, you can browse resources by type (rdf:type), search across the contentbox index, edit resources, view pending jobs and send new ones, import data, and configure the field-predicate mapping for your stores.

Please send bug reports and feature requests to keith dot alexander at talis.com

If you do want a talis store, just ask in #talis on irc.freenode.net, or email danny dot ayers  at talis.com

Batch Changesets ARC Plugin

Platform Release 12 included a very useful new feature: the ability to send more than one changeset in a single POST to your store.

To generate a batch changeset from 2 versions of an RDF graph, you can use an ARC plugin called Talis_ChangeSetBuilderPlugin.

To use it:


	  $args = array(
			'before' => $before, //can be rdf/xml, turtle, or an ARC simpleIndex array
			'after' => $after,  //can be rdf/xml, turtle, or an ARC simpleIndex array
		);
		$cs = ARC2::getComponent('Talis_ChangeSetBuilderPlugin', $args);
		$cs_response = $store->get_metabox()->apply_versioned_changeset($cs); 

The plugin also relies upon the IndexUtils Plugin. The easiest way to get them all set up is to change to your arc directory and do:


svn co http://n2.talis.com/svn/playground/kwijibo/PHP/arc/plugins/trunk/ plugins