Subscribe

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (”Red Dwarf”) may be related to a number of Series (”Red Dwarf Series 1″) and a Series is compromised of Episodes (”Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (”masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (”you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

Quick OpenCalais Hack

I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.

One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.

Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:

sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais

With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:


# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'

store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status

The code is here, and here’s some sample input and sample output.

The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results

So how do I plan to use this?

As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.

The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.

If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.

I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.

Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.

Pho 0.5

I’ve just sent a quick announcement to the n2-dev mailing list to announce a new release of Pho the Ruby client for the Talis Platform.

The CHANGES document includes a complete summary of the changes made since the 0.4.1 release. There have been quite a few of these as I’ve added more code for handling submission of directories of files to the Content Box; initial support for Changesets to support updates of RDF graphs, including support for “diffing” two graphs to generate the appropriate changesets for submission to the Platform; as well as a SPARQL protocol client and a number utility classes to make working with SPARQL queries slightly easier.

The addition of RDF/JSON for serializing results of CONSTRUCT and DESCRIBE queries makes it much easier to work with query results as they’re much easier to parse.

The latest release does add a dependency on the Redland Ruby language bindings to provide support for parsing of RDF/XML, Turtle and N-Triples. So you’ll need to install the relevant Redland packages to make use of the latest release.

Moriarty Progess Report

It’s been a while since I wrote about Moriarty, the PHP library I created for working with the Talis Platform. That’s not to say that there have been no changes: on the contrary, there have been lots of improvements and some major new areas of functionality. I’m going to summarise them in this post and then, time permitting, follow up with more detailed posts on particular areas.

  • Fresnel Selector Language — This is a major new addition to Moriarty. A new class called GraphPath has been added which implements almost all of the Fresnel Selector Language specification. I’ve been interested in RDF path languages for a long time and FSL now appears to be the strongest contender. Currently this is a stand-alone class, but after a few more cycles of testing it would be nice to add a convenience method to SimpleGraph to allow selection of resources using paths.
  • Zend-compatible caching — I did a substantial refactoring a while back to convert Moriarty’s HTTP caching implementation to be compatible with Zend’s cache interfaces. Whereas before you were limited to caching HTTP responses to disk, you can now supply any of Zend’s built-in cache classes to enable caching in databases, memcached and many other systems. You do this by creating an instance of HttpRequestFactory with your cache class and then pass the factory to the Store class.
  • JSON usageRelease 24 of the Talis Platform introduced RDF/JSON serialisation for describe and constuct SPARQL queries. Moriarty now requests this format where it and because the SimpleGraph class uses an RDF/JSON structure as its index often there is no result parsing involved at all.
  • OAI Service Support — OAIService is a new class that represents a store’s OAI-PMH Service. It provides simple access to the OAI service allowing all resources in a store to be listed.
  • Automated Builds — We have added Moriarty to an Hudson server which monitors the subversion repository and runs the unit test suite after every checkin. It’s not ideal because the server is not accessible to users outside of Talis (come on Google Code - we need Hudson support!). However it adds an extra level of confidence to checkins because test failures are emailed out to moriarty-dev@googlegroups.com which is open for anyone to join. A few times in the past we have run the unit tests locally and then forgotten to check in some critical dependency so the subversion trunk contains a broken build. Hudson will alert us to these kinds of errors much more quickly.
  • Extended Describes — The Sparql classes now accept an extra parameter to their describe methods that allows you to specify the type of description you want. By default you get the Platform’s default graph which is a list of triples that have the subject you specify (no bnodes remember!). Moriarty allows you to easily request other types such as symmetric bounded description (triples where the URI being described is subject or object), labelled bounded description (like the default description plus the addition of label properties for URIs in the description set) and symmetric labelled bounded description (a combination of the previous two). See the Bounded Description page on the n2 wiki for more information.
  • Richer Store Interface — Up until now Moriarty has used a object model that closely follows the Platform’s separation of services. That tended to make code using Moriarty quite verbose. We’re now gradually introucing convenience methods onto the Store class so common operations can be accessed with less code.

The moriarty Google Code project now has several committers although Keith and myself are still the most prolific. However, having multiple committers is one more step away from this being a personal project and towards it being community owned. Moriarty is being used in lots of small projects in and around Talis, but significantly it is also in the core of two of our most important products: Talis Prism and Talis Aspire. That’s great validation for Moriarty, although it brings a lot more responsibility in terms of quality of testing. I now consider Moriarty to be out of continual alpha and into continual beta!

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is being developed by small community of developers and is in continual beta, subject to a slow stream of updates. You can read more about Moriarty on the n² wiki or visit its Google Code project

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) - grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm - too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!