Subscribe

Archive for the 'Uncategorized' Category

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (”Red Dwarf”) may be related to a number of Series (”Red Dwarf Series 1″) and a Series is compromised of Episodes (”Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (”masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (”you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.

Pho 0.5

I’ve just sent a quick announcement to the n2-dev mailing list to announce a new release of Pho the Ruby client for the Talis Platform.

The CHANGES document includes a complete summary of the changes made since the 0.4.1 release. There have been quite a few of these as I’ve added more code for handling submission of directories of files to the Content Box; initial support for Changesets to support updates of RDF graphs, including support for “diffing” two graphs to generate the appropriate changesets for submission to the Platform; as well as a SPARQL protocol client and a number utility classes to make working with SPARQL queries slightly easier.

The addition of RDF/JSON for serializing results of CONSTRUCT and DESCRIBE queries makes it much easier to work with query results as they’re much easier to parse.

The latest release does add a dependency on the Redland Ruby language bindings to provide support for parsing of RDF/XML, Turtle and N-Triples. So you’ll need to install the relevant Redland packages to make use of the latest release.

Metamorph Open Source project for Semantic Converter Web Service

I’ve published the code behind the Talis Convert Service (production release at stable URL coming soon) as an open source project on Google Code, called Metamorph .

Metamorph is a service aimed at semantic web developers. It is much like triplr, babel, swignition and any23 (please leave a comment pointing to any other similar services).

You give it a(n http) URI, an (optional) input format, and an output format, and it will fetch the document from the web, and convert it into the output format.

Understood input values include:

  • Semantic HTML (RDFa, eRDF, microformats, POSH)
  • RDF (XML, Turtle, JSON)
  • SPARQL-XML
  • Facet XML (the response format of the facets service available on all platform stores)

Output for all input formats can be:

  • JSON
  • JSONP
  • HTML

If the input is some form of RDF, you can also ask for:

  • RDF (XML, Turtle, JSON, - and the default HTML is rendered as RDFa)
  • RSS 1.0
  • TriX
  • Exhibit (web page, JSON, JSONP)

In addition, if the input is an RDF format, you can specify multiple data URIs, and the results will be merged in the output document. For instance, this conversion merges data from two of my homepages, and a Turtle file.

I’m thinking about removing the TriX output, as I’m not sure it would be used by anyone - the reason I didn’t bother to write a parser for it was because I haven’t seen any data published as TriX in the first place.

I welcome any input on what else would be useful from this web service. I suspect that more output options, while fairly easy to add, would not be very useful. More input options may be useful, but perhaps not significantly so.

I suspect what might be more useful, and more likely to distinguish this from similar RDF converter services, are graph transformation services, which might include:

  • Diffs
  • Intersects
  • Smushing
  • Augmenting on property and class type URIs with labels and comments, perhaps retrieved from SchemaCache

Metamorph is coded in PHP, and uses ARC for parsing RDF and HTML, and serialising RDF/XML and Turtle.

Please use the issue tracker for raising any bugs or feature requests.

Getting Started With the Talis Platform Presentation

We ran a training workshop at the Talis offices last week with a small group of developers looking at using the Talis Platform for a community information project with which we’re involved. I thought it would be useful to share the slides from the session I ran.

The session was intended to provide a walk through of the main concepts, technologies and features of the Platform. The goal being to fill a gap between previous “What is the Talis Platform” presentations we’ve given in the past, and the detailed API documentation.

The slides can be found up on slideshare.


By rob

There is a consistent set of examples used throughout the presentation. These draw on some data I’ve been compiling about spaceflights. You can find the Platform store here, including the SPARQL endpoint (for testing the example queries), or look at some of the below URLs:

The data and schema is very much a work in progress and is likely to change. However there’s sufficient data there if you want to follow on with the presentation and explore some of the Platform features.

I plan to keep the presentation up to date with the data as it evolves and also hope to use the Slideshare “slidecast” features to add a voiceover to add in the missing context.

VRM with FOAF + OpenID

A quick note-to-self. I’m currently working on some other FOAF + OpenID stuff, so this is nearby enough that I might well put together a demo in the near future…but not today.

Tim Bray discusses Changing your address in the context of Vendor Relationship Management, prompted by “Feeds-Based VRM”: A Web-Centric Approach to VRM Implementation. The question is how you keep a vendor (or other contact) aware of your current address.

I came to the same conclusion as Tim, that feeds aren’t really necessary for this kind of thing, the data can be put directly on the Web and the contact given the appropriate URI. In comments over there I pointed to Tim Berners-Lee’s Give yourself a URI - an online FOAF profile solves most of the problem. The part it doesn’t solve is access control - you might not want to make your address public. But with the help of linked data, off-the-shelf tools and a little scripting, this is pretty easy to fix.

First of all, looking at how you might represent this information, vCard is the dominant model for this kind of info. Whether that’s expressed in the original vCard format or hCard or RDFa or RDF/XML doesn’t really matter. These can all be mapped to the RDF model, which is key to what follows… Here’s the relevant bit of a vCard in Turtle syntax (first pass, probably not 100% correct):

prefix : <http://www.w3.org/2006/vcard/ns#> .
[ a :VCard;
:agent <#me>
:homeAdr [
a :Address;
:street-address "7, Mozzanella" ;
:country-name "Italy"
] ;
]

Now I could just dump this in my public FOAF profile at, say http://example.org/public/me. But because I want the address to be restricted, I’ll separate the information (following the principles of linked data) like this -

in http://example.org/public/me -

prefix : <http://www.w3.org/2006/vcard/ns#> .
[ a :VCard;
:agent <#me>
:homeAdr <http://example.org/restricted/myaddress> .
]

and in <http://example.org/restricted/myaddress> :

prefix : <http://www.w3.org/2006/vcard/ns#> .
<> a :Address;
:street-address "7, Mozzanella" ;
:country-name "Italy" .

Now I need to wrap the latter part in authentication/authorization. Traditionally I might hard-code a list of who can see this data, but there’s a neater way. Somewhere I’ll put statements like the following (with proper URIs as appropriate):

<#me> foaf:knows [
<personA> foaf:openid <personAopenID>
]
<#me> x:businessContact [
<personB> foaf:openid <personBopenID>
]
<#me> x:businessContact [
<personC> foaf:openid <personCopenID>
]
<#me> x:businessContact [
<personD> foaf:openid <personDopenID>
]

Anyone wishing to see the restricted info will be asked for their OpenID URI. Whether they can see a particular resource can be governed by simple rules, for example expressed through string-templated SPARQL queries:

SELECT ?person
WHERE {
?person foaf:openid $openid$ .
OPTIONAL { <#me> foaf:knows ?person }
OPTIONAL { <#me> x:businessContact ?person }
}

Ok, that’s very sketchy, but hopefully gives the idea. To be properly declarative in practice you’d probably want to put the access rules in a separate chunk of RDF, and query across the whole lot. But given decent libraries (e.g. the OpenID PHP lib worked pretty much out of the box for me, and ARC is a really straightforward PHP RDF toolkit), we’re talking about maybe a days work to write and deploy the scripts - which could be used by anyone else with regular PHP-capable hosting.

A Web-centric approach to VRM should use the Web, and as Berners-Lee himself recently put it:

Linked Data is the Semantic Web done as it should be. It is the Web done as it should be.