Subscribe

Visualising BBC Programme Categories

Whilst I was exploring the BBC programmes data looking for possible demonstration applications I thought it might be interesting to try and create a visualisation of the relationships between different categories of BBC programmes The BBC datasets use SKOS as a categorization scheme, with separate taxonomies for formats (e.g. documentaries, animation, etc) and genres (e.g. childrens programmes, science fiction, etc). If you poke around a little, you can also see a nascent category system for places and people, although there doesn’t seem to be much data there at present (and what is there seems to change regularly).

For my purposes, the genre classifications looked most interesting. Episodes are associated with their genre category via the po:category property. As I was interested in finding relationships between genres, what I was looking for was a way to relate together individual categories, other than by the obvious super/sub-category relationship.

It occured to me that if two categories were associated with the same episode, then this could be viewed as a declaration of some implicit relationship between the categories. Extracting this in SPARQL is straight-forward, as we just need to match episodes that have more than one category:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

In the above SPARQL query we match any episode that has at least two categories (because we use two po:category patterns), and where those categories are different (in the FILTER). This excludes the unwanted result where the ?category and ?related variables are bound to the same value. I didn’t bother with pruning out duplicates as this could easily be done on the client-side.

In order to visualise the results, I decided to use MooWheel. This provides a simple Javascript visualisation toolkit for presenting connections between a set of resources. MooWheel can be configured using a JSON data structure, so generating a a MooWheel visualisation from a SPARQL query is relatively straight-forward: the query results can be retrieved as SPARQL/JSON which can then be massaged into the appropriate JSON structure to generate the MooWheel visualisation. Check out the source code of the demonstration for sample code to do this (look at the success callback).

My first attempt at a visualisation simple executed the above query across the entire BBC dataset. This generated a huge wheel of connections between the categories, but ultimately the visualisation wasn’t that useful. So I decided to refine the visualisation to generate separate category wheels for each of the main BBC TV channels. This involved refining the SPARQL query to include an extra triple pattern to limit Episodes to just those associated with a specific channel (po:masterbrand). The following revised query restricts results to BBC 1:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:masterbrand ;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

The results of this visualisation is much more interesting.

Each of the BBC channels has a different range of programming and this emphasis is really clear in the visualisation. Compare for example BBC 1 and BBC 3, or either with BBC 4. For those of us in the UK who have already internalised this, there may not be a great deal of new information here, but its nice to see how this feature of the dataset can be easily surfaced with very little effort. There’s more analysis that could be done here though, particularly if the BBC open up their programme archives. For example, how do the range of programme categories for a channel change over time? Which programmes actually link the different categories together? Could other visualisations provide more insight into the programming than a simple relationship wheel? For example, could a treemap style visualisation give some indication of the amount of schedule time devoted to a particular category of programme?

Why not see what you can come up with?

Presenting BBC and NASA data using Freemix and SPARQL

One of the most interesting applications I saw at the recent Semantic Technology conference was Freemix. The application, which is currently in limited beta, allows anyone to easily create customized views over data that they upload into the system. There’s also the usual networking features providing an additional social dimension to data sharing and publishing. As I understand it Zepheira have plans for expanding the range of features in all sorts of ways, including new visualisations, the ability to merge and remix data from several sources, and naturally enough a commercial version that can be deployed within the enterprise.

The core of Freemix is Simile Exhibit and a drag and drop interface for building up an Exhibit presentation over data that the user has uploaded. Data can be presented in several different ways, including simple tabular spreadsheets and in the Exhibit JSON format. Exhibit provides a number of different existing views suitable for presenting data, including lists, tables, maps, timelines, etc. As a web developer its straight-forward to build up your own Exhibits; Freemix takes this to the next level, making it trivial to build a presentation in just a few clicks, without the need to learn any markup: you just have to understand your own data.

Naturally enough I was curious to know whether Freemix could be used to build presentations of Linked Data, and specifically whether I could feed it with data from the Talis Platform. I’ve been working with the BBC data quite extensively recently, and have been compiling a space flight dataset. So I thought I’d use those as my test cases. Both of these datasets are in Platform stores, so I explored the options for extracting data using a SPARQL query in order to build a presentation in Freemix. It turns out its really easy.

Freemix supports importing JSON data from a URL, so I knew that in theory I could write a SPARQL query against a Platform store and use the SPARQL protocol request URL as the import target. As I didn’t want to extract the whole dataset, just some interesting subset for my presentation, a SPARQL CONSTRUCT query seemed like the best option. Like Exhibit, Freemix requires a relatively flat data structure — i.e. resources with properties, rather than a true directed graph. This means that within the CONSTRUCT query I would need to simplify the graph structure, removing some of the richer modelling, to re-shape the data to fit Freemix’s expectations.

Here’s a query I came up with for my NASA data:


PREFIX rdfs:
PREFIX dc:
PREFIX space:
PREFIX xsd:
PREFIX foaf: 

CONSTRUCT {
?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launched ?launched;
dc:description ?description;
space:discipline ?label.
}
WHERE {
?launch space:launched ?launched.

?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launch ?launch;
dc:description ?description;
space:discipline ?discipline.

?discipline rdfs:label ?label.

FILTER (?launched > "2005-01-01"^^xsd:date)
}

The query finds all spacecraft launched since 2005, extracting the name, agency, mass, etc. The labels of the disciplines (subject categories) and the launch dates which are originally associated with separate resources in the underlying graph, are re-presented here as properties of the spacecraft itself. Not ideal in a modelling or data interchange perspective, but a reasonable trade-off for shaping data for presentation purposes.

So far so good. The Talis Platform supports a range of output options from CONSTRUCT queries including both RDF/XML and RDF/JSON. Unfortunately Freemix doesn't support RDF/JSON as an input option although this would make a nice addition to the range of import options. In order to convert from the RDF/XML to the Exhibit/JSON format for Freemix I used the Talis Morph service. Morph is a simple service that provides a number of options for converting between semantic web formats. RDF/XML to Exhibit/JSON is one of those options, so all I needed to do was pipe the original SPARQL query URL through the morph service to get my final import target for Freemix.

You can view the imported data on my Freemix homepage. And here's a presentation of that same data. As you can see the presentation provides a list and table views, piecharts that break down launches by agency and discipline, and also a timeline view of the launches. This was incredibly quick to put together.

I tried the same approach with some BBC data. So here's a simple Dr Who episode guide as a Freemix. The presentation options are a little more limited here, partly because there aren't as many natural facets to the BBC data, but also because Freemix doesn't (yet?) offer the ability to, e.g. create a coverflow presentation of images, or a tag cloud over blocks of text. The ability to mark fields as numbers and sort tables by multiple fields would also be useful. Having said that, trying searching for "Rose" in the search box to see which episodes descriptions mention her; note that the series facet on the left also automatically updates.

As you can see from the SPARQL query, some massaging of the graph structure was required to include series titles against each episodes.


PREFIX foaf:
PREFIX rdfs:
PREFIX po:
PREFIX dc:
PREFIX freemix: 

CONSTRUCT {

?episode a po:Episode;
foaf:depiction ?depiction;
freemix:seriesTitle ?seriestitle;
dc:title ?title;

po:position ?position;
po:short_synopsis ?syn.
}
WHERE
{
po:series ?series.

?series dc:title ?seriestitle;
po:episode ?episode.

?episode a po:Episode;
foaf:depiction ?depiction;
dc:title ?title;
po:position ?position;
po:short_synopsis ?syn.

}

My only other issue with Freemix is the live-ness of the data. Ideally instead of having to import data directly into the system, it should instead be fetched from source either on demand or on a regular basis. I suspect this is the kind of feature that will end up in a commercial version of the product.

Overall though I was quite pleased with how easy it was to create these kinds of presentations. I'm convinced that for Linked Data to truly hit the mainstream we need simple tools like Freemix that let all of us easily compile and create custom presentations of data. Obviously, we also need to be able to easily select the data that we want to display, and very few people will want to bother with SPARQL queries. So I think there is some interesting work to be done to create SPARQL query builders that tie into browsers, e.g. so I can select the data facets I'm interested in as I browse and then choose to represent those facets in different ways.

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (“Red Dwarf”) may be related to a number of Series (“Red Dwarf Series 1″) and a Series is compromised of Episodes (“Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (“masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (“you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.