Subscribe

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (”Red Dwarf”) may be related to a number of Series (”Red Dwarf Series 1″) and a Series is compromised of Episodes (”Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (”masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (”you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

Using Twinkle to SPARQL the Platform

A few years ago I wrote Twinkle, a simple GUI interface for working with SPARQL. While its not the most polished of user interfaces and its in sore need of an update, it’s still serviceable and has been successfully used as a development tool by teams of engineers I’ve worked with in the past.

I gave a short talk on Twinkle at an Oxford SWIG meeting, so you can flick through the slides if you want a quick overview of the functionality. I also moved the code to a google code project to start the process of updating it

Twinkle has the capability to work with a range of different data sources and includes a full SPARQL client, so you can use it to work with any SPARQL endpoint that is accessible from your desktop. Out of the box Twinkle is already configured to work with the Govtrack and DbPedia endpoints, but you can easily add more by changing the configuration.

If you download and unzip the distribution into a directory you should end up with an etc/config.n3 file. This file contains all of the configuration that drives the user interface, including a section that configures remote SPARQL endpoints, e.g:


<http://dbpedia.org/sparql> a sources:Endpoint
    ; sources:defaultGraph "http://dbpedia.org"
    ; rdfs:label "DBpedia.org".

<http://www.rdfabout.com/sparql> a sources:Endpoint
    ; rdfs:label "GovTrack.us".

The above snippet configures two remote endpoints, and applies labels to them so that they appear in the Twinkle UI, under the “Remote Services” section on the left-hand menu. Because some endpoints, such as DbPedia, require to specify a default graph in the SPARQL protocol request, you can also specifiy that in the configuration if necessary.

If you have a Platform Store, or just want to access some data held in the Platform, then you can use Twinkle to perform your SPARQL queries. For example I have a store containing NASA space flight data. The SPARQL endpoint for this store is at:

http://api.talis.com/stores/space/services/sparql

So to register this in Twinkle, I can edit the configuration file and include the following snippet:


<http://api.talis.com/stores/services/sparql> a sources:Endpoint
    ; rdfs:label "NASA Space Data".

Once you’ve restarted the UI you should now be able to click on the Remote “NASA Space Data” service and open up a window into which you can start executing SPARQL queries.

If you’re new to SPARQL, or are interested in playing with the above space data, then you can look over the following slides from a recent SPARQL training session that I ran:


By rob

The slides contain a number of sample queries that should help get you started. Unfortunately some of the diagrams don’t look great in slideshare, but you should be able to download them for a closer look.

Authoring RDF data with SPARQL

Yesterday Yves Raimond and I presented a tutorial at WOD-PD where we created some turtle data and used my online semantic converter tool to convert the data to RDF/XML and POST it to the platform store we set up for the tutorial (wod-pd-sandbox).

In fact though, every SPARQL endpoint that supports CONSTRUCT is already a turtle -> rdf/xml converter. You can write Turtle with no variables in the CONSTRUCT graph, leave the WHERE graph pattern empty, and you will get back RDF/XML.

eg:

PREFIX ex: <http://example.org/>
CONSTRUCT {
  ex:Jimmy ex:eat ex:World .
}
 WHERE {}

returns

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:ex="http://example.org/" >
  <rdf:Description rdf:about="http://example.org/Jimmy">
    <ex:eat rdf:resource="http://example.org/World"/>
  </rdf:Description>
</rdf:RDF>

You can also use CONSTRUCT to create new data inferred from existing data. For instance, I wanted to add some triples about the conference, and I knew that everyone in the store with a URI in the store’s own namespace had been following the tutorial, and so was also attending the conference. So I made this query, and then POSTed the results into the store:

           PREFIX schema: <http://api.talis.com/stores/wod-pd-sandbox/items/Schema/>
	PREFIX sandbox: <http://api.talis.com/stores/wod-pd-sandbox/items/Things/>
	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
	PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
           PREFIX owl: <http://www.w3.org/2002/07/owl#>

	CONSTRUCT { 

		schema:Conference a rdfs:Class ;
		rdfs:isDefinedBy schema: ;
		rdfs:label "Conference" .

		schema:startDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "start date" .

		schema:endDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "end date" .

		schema:attendee a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attendee" ; owl:inverseOf schema:attended .

		schema:attended a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attended"; owl:inverseOf schema:attendee .

		sandbox:WOD-PD a schema:Conference ;
		           rdfs:label "Web of Data" ;
		           schema:startDate "2008-10-22" ;
		           schema:endDate "2008-10-23" ;
					   schema:attendee ?person .
		?person schema:attended sandbox:WOD-PD .
}  WHERE
{
	?person a <http://xmlns.com/foaf/0.1/Person> .

           FILTER(REGEX(STR(?person), "sandbox/items/People/"))
}

I used PREFIX to declare a prefix for a couple of namespaces with the store’s contentbox URIs - this meant that these URIs would dereference and work as Linked Data - 303ing to their RDF descriptions. This is a really nice feature of the platform, and makes it easy to mint new URIs that will play nice on the semantic web.

You might also have noticed that there are some new properties and classes defined there in the CONSTRUCT. This isn’t absolutely ideal - there is no documentation, and the terms are unlikely to be used again - but on the other hand, the descriptions are dereferencable according to the principles of linked data, and just as persistent as the data they describe. Moreover, as Richard Cyganiak said today - if you worry about doing RDF ‘right’ to the extent that it stops you doing RDF, you’re not doing it right.

GRDDLing DeWitt’s Friends

DeWitt Clinton has a great write-up of Creating a HTML “friends” page from a Google Reader subscription list, a bit of hackery which leads to a hCard microformat-enriched friends list. A little tweak to the HTML can make it more machine-friendly, just adding a HTML Meta Data profile URI:

<head profile="http://www.w3.org/2006/03/hcard">

That profile is GRDDL-enabled, so any GRDDL-aware agent can interpret the source document as RDF. This part’s easy to demonstrate, thanks the online W3C GRDDL service. So I’ve put a tweaked version of the HTML online, and here’s DeWitt’s friends page as RDF (in Turtle syntax, rendered a little verbosely).

Having set this up I realised the data wasn’t actually expressing the friend relationship, so went on to put together some SPARQL to sort that out - below. But afterwards I realised that DeWitt’s HTML was actually expressing the relationships using XFN class names, but again without the profile URI to make it machine-friendly. So another tweak:

<head profile="http://www.w3.org/2006/03/hcard http://www.w3.org/2003/g/td/xfn-workalike">

- the corresponding service output (scroll down to see the extra bits). I suppose I should mention that you can have as many space-separate profiles as you like, and the GRDDL-aware agent will interpret them independently, just accumulating all the triples. The second profile URI adds xfn:friend relationships, I think it would have been more useful with foaf:knows as well, but it is only a demo.One of these days the microformats folks might get around to tweaking the official profile appropriately…

The SPARQL I mentioned looks like this:

prefix rdf:
prefix vcard:
prefix foaf:

CONSTRUCT
{
[ a foaf:Person;
foaf:homepage ;
foaf:name "DeWitt Clinton" ;
]
foaf:knows
[ a foaf:Person;
foaf:homepage ?homepage ;
foaf:name ?name ] .
}
WHERE
{
[ a vcard:VCard ;
vcard:url ?homepage ;
vcard:fn ?name ]
}

- when applied to DeWitt’s data (as RDF), this will map it across from the vCard vocabulary - finding the appropriate ?variables by matching the pattern in the WHERE clause, inserting those ?variables into the CONSTRUCT clause to produce some new RDF.

I tried this on the Redland SPARQL demo, and I think it’s producing the RDF I wanted. Unfortunately the serialization is really ugly - lots of bnodes, and it’s hard to check visually. It appears to confuse Tabulator too, and the W3C RDF Validator which is handy for this kind of visualization appears to be down. (Here’s a copy of the RDF/XML). Still, it was only a workaround - with the right profiles in place it’s not needed.

I’m not sure if there’s a microformat way of expressing that the source data was a subscription/reading list. To get the richest RDF out it might be easier to do what DeWitt did, but to a full RDF serialization rather than microformatted HTML (which is effectively a CustomRdfDialect), producing something like Planet RDF’s blogroll.

Drupal and the opportunity of RDF

At the start of this week, Dries Buytaert presented the keynote presentation at DrupalCon 2008 . The most exciting revelation came at the end: Drupal’s future is in the semantic web..

While Dries talks about the semantic web, and RDF, you don’t hear much reaction from the crowd; but then he says Let me show you a video of the future And proceeds to demonstrate SPARQLing on linked data from sources like dbpedia dbtunes, geodata, events, friends lists, and google spreadsheets, mashed-up in Exhibit.

This gets a lot of applause :)

In the keynote, he puts emphasis on data interoperability, decentralisation, remote querying, and how having a lot of data is great fun :)

It’s a really great talk, with a lot of excellent quotes about the value of RDF for Drupal, here are some of my favourites:

Web 3.0 (much as I hate to use the term) is all about infinite interoperability

We have the opportunity to be mentioned in the history books of the web … This is where the web is going. And this right time, and the right place, to make it happen.

Using RDF you can connect all these different parts of data, that live in different parts of the web.

RDF turns the web into a database

The real opportunity we have here is to start sprinkling this map [of linked open data sources] with Drupal. Every single Drupal site can be an RDF repository that people can query

Google are trying to build a world social graph, connecting people … but what we are doing with RDF is connecting not just people, but everything

With RDF, the import/export problem we have in Drupal just goes away. It just works, without having to describe database schemas… It just works. It’s a problem that is already solved.

You can listen to the audio of the presentation at archive.org (~45MB - the RDF stuff starts at around 53 minutes), and view a video of the RDF demonstration

You can also read more about Drupal and RDF here

Ask Moriarty?

Another day, another incremental improvement to Moriarty (svn revision 490)! After my last set of changes I thought I’d better hurry up and add the copy_to function to the FieldPredicateMap too. You can now clone Field/Predicate Maps from one store to another:

  $fp = new FieldPredicateMap("http://api.talis.com/stores/mystore/config/fpmaps/1");
  $response = $fp->get_from_network();
  if ( $response->is_success() ) {
    $new_fp = $fp->copy_to("http://api.talis.com/stores/otherstore/config/fpmaps/1");
    $new_fp->put_to_network();
  }

I then set about thinking through my plan for adding HTTP caching support to Moriarty. I want this to work automatically and transparently, taking advantage of conditional GETs on the Platform. I’ll let it be switched off by defining a constant but I want it to be there by default so the developer gets the benefit without any effort.

I stubbed out some initial ideas for the HttpCache class on the train this morning. Then at lunchtime today, Danny pinged me on IRC wondering why Moriarty didn’t have SPARQL ASK support. “Not by design”, I said, “more by lack of time. But it should be easy to add, give me 15 minutes”. Then I promptly went into a series of meetings that ate the rest of my day. In the end the code did only take 15 minutes, but I finished it 11 hours later than I expected. Hopefully Danny didn’t spend all that time waiting for me to respond on IRC :-)

You can perform an ASK query on a store like this:

  $store = new Store("http://api.talis.com/stores/mystore");
  $sparql = $store->get_sparql_service();
  $response = $sparql->ask( "ASK WHERE {?s a .}" );
  if ($response->is_success()) {
    $result = $sparql->parse_ask_results( $response->body);
  }

Enjoy, Danny!

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps ups many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Which Store to SPARQL?

We’ve got quite a lot of different stores in the Talis Platfom, some of which have some pretty interesting data. The question is, what’s in them? A while ago, I polled all the stores in the platform (you can get a list as HTML or RDF at http://api.talis.com/stores) for some basic stats on the rdf:types and predicates in each store, and saved them in the silkworm-dev store.

It just occurred to me that, using (for example), ARC’s standalone SPARQL parser, I ought to be able to parse a query, and generate another query for the silkworm-dev store, to find a list of stores that you could run that query on and get some data back.

I guess this will get even more interesting when we add Store Groups into the mix (a coming-feature, where you can query a group of stores at once).

I’ll have to try it sometime soon :)