Subscribe

Archive for the 'Uncategorized' Category

Surge 2010

Having now finally gotten over my jetlag, I’ve had a few minutes to write up my notes from Surge 2010, which was a really great couple of days, perfectly filling its niche. It also had probably the best lineup of speakers at any conference I’ve attended. Aside from the content, the whole thing was brilliantly organised and run by OmniTI, who deserve a massive amount of credit for initiating such an awesome event. Mostly for my own benefit, I’ve collected a few writeups from other folk who attended, and videos & slides from pretty much all of the sessions are due to be published any day now.

The main message coming through was read more, learn more, share more. This theme ran through a number of talks, from John Allspaw & Brian Cantrill‘s opening keynotes to Theo’s closing plenary where he delivered the 11 Commandments of Scaling. There’s a huge body of literature out there constantly being produced by the academic and research communitities. In general, we in industry are not particularly good at putting it to use and building on top of it – all too often we’re found re-inventing the wheel, making the same mistakes over and over, and then perpetuating this vicious circle by not sharing our experiences with our peers.

Standout sessions for me included Allspaw’s keynote, delivered with customary insight and aplomb, where he talked of the absolute immaturity of Web Operations as a discipline, and of the huge amount that we can learn from more established like civil & mechanical engineering, the aerospace and utilities industries which have been tackling similar-shaped problems for decades, if not centuries.

Another highlight for me was Basho CTO Justin Sheehy‘s session on concurrency in distributed systems. Here, we got right to the nub of the issue – in any complex system, both in the real universe and in computer systems, its usually not correct to think of time as a single linear flow of events occurring in lockstep. Any software system, particularly any distributed system, that attempts to hide the underlying asynchronicity that this entails is fundamentally flawed. There are no strong guarantees of consistency in the physical world and certain domains, like banking for example, have long recognised this and built compensating mechanisms into their systems. A great soundbite is that we shouldn’t aim to build reliable systems (i.e. one that do not fail), but that we should aim to make our systems resilient to the failures that they will inevitibly encounter.

There were also some great case studies and war stories including Artur Bergman‘s deep dive into operations at Wikia, Ruslan Belkin‘s ‘Going 0 to 60: Scaling LinkedIn’ and Geir Magnusson’s detailed walk through of how gilt.com scaled up from a typical n-tier application by building out a loosely coupled, service oriented back end.

I definitely learned a lot, had a bunch of things reaffirmed, and also found a lot of great validation for the stuff we’re doing on our Platform. Can’t wait for next year.

Royal Society Web Science

I’ve just spent the last couple of days at the Royal Society Web Science discussion meeting which I felt was a very special event for the following reasons.  Web Science (the internet/www as an object of scientific study) is emerging as a new interdisciplinary field of activity with collaborators from both science and the humanities.  This cross over of ideas from many different disciplines (physics, mathematics, computer science, politics, philosophy, sociology) could prove fruitful, and indeed there were speakers at the event from all these disciplines. All of the speakers were very good indeed, some excellent, and all with high calibre backgrounds and good credentials; people who have obviously paid their dues with years of hard work and good research.
Some common themes, ideas mentioned by more than one speaker were as follows. More than one person mentioned Frigyes Karinthy and the 6 degrees of separation concept. Another theme was the value to researchers of having at their disposal unprecedentedly vast amounts of rich data, the “digital traces” (Kleinberg) of all of our interactions on the web. With this kind of data sociologists and other students of humanity have the ability to examine human behaviour, and may be able to prove and disprove theories by empirical studies at a scale not possible before.  Another common theme was the value of the internet and the web. The value of maintaining the structure of the internet and ensuring its security and scalability and the value of keeping the web democratic and open.
The presentations should be available to view from http://royalsociety.org/Web-science-a-new-frontier/.

Heading out to Surge 2010

A couple of us will be flying out to Baltimore tomorrow for the inaugural Surge Conference. Billed as “more than an event, it’s a chance to identify emerging trends and meet the architects behind established technologies”, the speaker list includes some real heavyweights and its hard (really hard) to pick which sessions to miss.

If you’re going to be there and fancy meeting up, feel free to ping either of us @beobal & @daveiw

Putting Structure Into Application Logs

Most of us know that application logs can be a fantastically rich source of data, and if you can mine them effectively they can be an extremely valuable resource in lots of contexts. We make use of the app logs from the various Platform components in a number of ways – for trend analysis, monitoring, tracking deployments of new features, fault diagnosis and post-mortem analysis.

Unfortunately for us ops peeps, application logging is a big, moving target. We might dark deploy a version of the software with super verbose output, or we might introduce a new feature which renders our previous perl + regex home brew tool to pull out important values pretty useless, this means that the important data that we want to trend on in March 2009 needs a different regex to June 2010.

Standards! I hear you cry! It seems (to me) existing standards are good at getting the message into the “right” place in the log, which makes them easy to parse and pass around the network, but they rely on the fact that you will be saying the same thing in the logs over and over again – they rely on rigid structure, and a change to logs is the same as adding a new column to an sql database. This means that if you are trying to match values on their position – sometimes adding a new variable will render your previous match useless – and worst still you might not know about it until it is too late!

Basically, string matching just isn’t robust or portable enough for what we want to do so we’ve been adding structure to the data in our logs. We’re currently outputting logs from some of our services as json – this means that we can dynamically add and remove variables as we wish, we can potentially send the logs straight to an indexer, load them into a db, convert them to RDF, or process them using Hadoop-based tools. We can still do our perl based graph building – by converting the json to an array or a hash and pulling out the right fields – irrespective of where they appear in the output.

To get a fully jsonised log line we need to work with 3 main areas, firstly the application itself, the applications logging handler, and finally the centralised logging tool, here is some real output:

{
    "syslogDate" : "2010-06-28T15:24:45+01:00",
    "syslogFacility" : "local0",
    "syslogPriority" : "info",
    "host" : "somehost.talis",
    "Message" : {
        "Process": "WS",
        "Date": "2010-06-28 15:24:45,834",
        "Priority": "INFO",
        "Category": "MemoryUsageLogger",
        "Thread": "Memory-Usage-Logger-Thread",
        "id": "",
        "Message": {
            "Memory Usage" : {
                "Heap" : {
                    "init" : "60",
                    "committed" : "73",
                    "used" : "26",
                    "max" : "864"
                },
                "Non-Heap" : {
                    "init" : "23",
                    "committed" : "24",
                    "used" : "22",
                    "max" : "216"
                }
            }
        }
    }
}

The first part of the json comes from syslog-ng, this puts in the details around how syslog sees the Log Message, the first sub “Message” comes from Log4J, and is how Log4J sees the message and the final “Message” is from the application itself, which in this instance gives us some general information about Memory usage.

As well as making our lives easier, it also means we can correlate on events must more quickly, and easily. From the simple Json above – we could write a rule that identifies descrepancies between “syslogDate” and “Date”, which could indicate a problem with the system clock on the application server, or even identify a problem with the syslog server unexpectedly slowing down.

One of the uses we’re putting this too is to build real-time reporting of coarse-grained application events to compliment the views we get from ganglia etc.

Moriarty DataTables: Active Record for RDF

DataTables are a new addition to the Moriarty PHP library. They are an implementation of the ActiveRecord pattern for use with RDF data in Talis Platform stores. It draws inspiration from the active record implementation in CodeIgniter.

The intention is to allow querying of RDF data in a natural way for most PHP coders. For example:

$dt->select('firstname')->from('person')->where('surname','Evans');
$dt->get();

In a relational database that kind of code would select the firstname column for every record in the person table that has a surname column with a value of Evans. With RDF we have two problems:

  1. there are no columns or tables, instead we have properties and classes.
  2. URIs are used to name things and URIs are long, ugly and easy to get wrong.

Moriarty’s DataTable class attempts to solve these two problems. It solves the first by treating properties as columns and classes as tables. The second problem it solves by allowing the user to specify short names for URIs. So we can write:

$dt->map('http://xmlns.com/foaf/0.1/firstName', 'firstname');
$dt->map('http://xmlns.com/foaf/0.1/surname', 'surname');
$dt->map('http://xmlns.com/foaf/0.1/Person', 'person');
$dt->select('firstname')->from('person')->where('surname','Evans');
$dt->get();

We can read that as selecting all values of the foaf:firstName property for resources of type foaf:Person that also have a foaf:surname property with a value of Evans. The DataTable class converts that into a SPARQL select query behind the scenes.

This means you can very simply query and use RDF data from a Talis Platform store. To get the first 10 names and nicknames from a store:

$dt = new DataTable('http://api.talis.com/stores/mystore');
$dt->map('http://xmlns.com/foaf/0.1/name', 'name');
$dt->map('http://xmlns.com/foaf/0.1/nick', 'nick');

$dt->select('name,nick')->limit(10);
$res = $dt->get();

foreach ($res->result() as $row) {
   echo $row->name;
   echo $row->nick;
}

I’ve written up a collection of example queries based on the education data held in the data.gov.uk service.

When I was thinking about how to map the ideas from active record into RDF I was stumped at how to implement table joins. This bothered me because if there is one thing RDF excels at it’s links between resources. Here’s an example of how CodeIgniter implements the join syntax:

$this->db->select('firstname, blog.title');
$this->db->from('person');
$this->db->join('blog', 'person.id = blog.id');

It turned out that the answer was incredibly simple and elegant: you don’t need them! The whole concept of the join method in most active record implementations is to compensate for the fact that relational databases don’t name their relationships (some do but it is very rarely used in practice and not commonly supported in SQL). If you think about the RDF equivalent of that query it becomes clearer: select the name of each resource of type person and for each of their blogs select its title. That join is just the property relating the person resource to the blog resource, probably foaf:weblog.

When you use a DataTable you specify a join simply by including a dotted property path in the select method, e.g. blog.title where blog and title both map to properties. That lets us write our query like this:

$dt->map('http://xmlns.com/foaf/0.1/firstName', 'firstname');
$dt->map('http://xmlns.com/foaf/0.1/weblog', 'blog');
$dt->map('http://purl.org/dc/elements/title', 'title');

$this->db->select('firstname, blog.title');
$this->db->from('person');

Ignoring the mappings this is much simpler than the relational database equivalent! Here’s a good example of using these joinless queries.

DataTables aren’t just for querying. They also support insert and update. To insert a new description of a resource:

$dt = new DataTable('http://api.talis.com/stores/mystore');
$dt->map('http://xmlns.com/foaf/0.1/name', 'name');
$dt->map('http://xmlns.com/foaf/0.1/Person', 'person');
$dt->set('name', 'scooby');
$response = $dt->insert('person');

This translates to submitting a description of a blank node, with a foaf:name property having a value of scooby and an rdf:type of foaf:Person. If you want to submit a description about a resource with a known URI, then you need to set the special _uri field like this:

$dt->set('_uri', 'http://example.com/people/1');
$dt->set('name', 'scooby');
$response = $dt->insert('person');

Behind the scenes the insert method generates the RDF and POSTs it into the store’s metabox. Updates work in a similar way:

$dt->set('_uri', 'http://example.com/people/1');
$dt->set('name', 'scooby');
$dt->where('nick', 'scoob');
$response = $dt->update();

Here the update method queries the store for the current value of the name property for the specified resource and generates a changeset which it then submits to the store’s metabox. This also works for multiple resources, so to update the resource description for anything with a name of shaggy to have a name of scooby:

$dt->set('_uri', 'http://example.com/people/1');
$dt->set('name', 'scooby');
$dt->where('name', 'shaggy');
$response = $dt->update();

Full documentation can be found here: DataTable and DataTableResult

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is being developed by small community of developers and is in continual beta, subject to a slow stream of updates. To find out more visit its Google Code project

voiD stores and Interesting Queries

Amongst the best incentives for data authors are applications that use that data. One sort of data that especially interests me is dataset metadata, for which the voiD vocabulary was developed; I think this kind of data has the potential to enable the future generation of web apps to join together the ever-growing web of data in wild and exciting new ways. So I was pretty pleased when I saw the voiD store from RKB Explorer. This store provides a SPARQL endpoint over all the voiD descriptions RKB Explorer have produced about their datasets, plus some descriptions they’ve gathered about other datasets. It also provides a list of source documents, sample queries, and a service that takes a list of URIs, and returns a list of SPARQL endpoints that might be able to return triples about them.

This, together with a rainy weekend, prompted me to try out some simple voiD-related things I’d been thinking of. I’ve also been aggregating voiD data in one of my dev stores. This is done partly by creating templated descriptions from a list of Talis Platform stores and poking at them with some SPARQL queries. The rest of the data I found either manually, or by querying Sindice for a list of void:Dataset URIs found in the documents they’ve crawled.

The Sindice API allows you to specify triple patterns with wildcards, and will return you an Atom feed: * rdf:type void:Dataset . I page through the results, importing the RDF from the URIs into my store.

One of my favourite terms from voiD is void:uriRegexPattern, which can be used to indicate that if a URI matches the pattern, the dataset might contain some triples about that URI. You can do this with a bit of SPARQL:

    
prefix void: <http://rdfs.org/ns/void#>
DESCRIBE ?dataset {
     ?dataset void:uriRegexPattern ?regex ; void:sparqlEndpoint ?sparql ; a void:Dataset .

    FILTER(REGEX("http://example.com/my/uri", ?regex))
}

    

The novel thing here is that normally, when you use REGEX() in SPARQL, you put a variable binding in the first parameter position, and hardcode a regular expression into the query in the 2nd position. Here though, the regex is in the data, and it is the string against which it is evaluated which is hardcoded, and the variable binding contains the regex. (Unfortunately, while this works with ARQ, it doesn’t appear to work with 3Store – which is perhaps why the rkbexplorer voiD Store provides this as a separate web service).

So, I’ve used this to create a page that will take a URI, and query my voiD store for void:sparqlEndpoints and void:uriLookupEndpoints, which it will then call to retrieve triples and render them on the page. Here is a query for the URI http://climb.dataincubator.org/dataset .

Another query that interested me, which has become possible since the Platform introduced support for the COUNT() function from SPARQL 1.1, is, which are the most commonly used vocabularies? (SIOC and FOAF so far! – thought this is because I generated many of these triples based on scripted prodding of endpoints with ASK queries) But then I wanted to be able to see easily which datasets used which vocabularies, so I created some pages to let me browse datasets by vocabulary.

  1. SIOC Core Ontology Namespace(54)
  2. Friend of a Friend (FOAF) vocabulary(42)
  3. Coreference Ontology (35)
  4. http://www.aktors.org/ontology/portal# (34)

  5. http://www.aktors.org/ontology/support# (30)
  6. http://www.rkbexplorer.com/ontologies/resist# (30)
  7. void (25)
  8. http://purl.org/NET/scovo# (24)
  9. http://acm.rkbexplorer.com/ontologies/acm# (22)
  10. http://courseware.rkbexplorer.com/ontologies/courseware# (21)

Then I made some pages to do the same thing with dct:subjects. Here, the largest category by some way, is category: online_social_networking. This is because I generated ?dataset dct:subject <http://dbpedia.org/resource/Category:Online_social_networking> . triples automatically for all the platform stores which made a certain use of terms from the SIOC ontology.

These automatically generated voiD descriptions will not, of course, present such a balanced picture of what is out there, and skew the results somewhat. The most interesting descriptions are those which are handcrafted to some extent, describing something of the nature of the dataset’s domains.

I’ve also provided a form for submitting voiD URLs to. My hope is that this simple application, together with the rkbexplorer voiD Store, might encourage more people to describe their linked data datatsets with voiD, or perhaps add more detail to the descriptions they already publish, in order to see their dataset come up in the appropriate queries. And I hope that this, in turn, will encourage others to build more sophisticated and exciting applications using that data.

Introducing Pynappl

Over the summer I spent some time working on a Python library for working with the Talis Platform. I’ve spent a lot of time developing the PHP-based Moriarty library and I’ve been wanting to apply that experience to other languages. Leigh has made good progress on the Ruby front with Pho and we have a nascent Java-based client: Penry. Considering Python’s excellent RDF support it seemed the natural choice to tackle next.

Pynappl is the resulting library. It’s still very early in its evolution and so has lots of rough edges, gaps and rather dubious design choices. So far Pynappl’s feature set has been driven by the real applications I have been working with so there is a distinct bias towards data loading and management of stores. The Store class is the workhorse of the library and contains methods for loading RDF, running SPARQL queries, scheduling jobs and reading and writing of field/predicate maps and query profiles.

In keeping with my general philosophy for building RESTful applications, the HTTP based methods on the Store class make it very obvious that you are working with a fallible network by returning a tuple containing the HTTP reswponse and the body of the response. Its up to you to use or ignore the response as you see fit for your application. Many methods attempt to interpret the results of the method call but this can be switched off using an argument called “raw”. For example this code takes advantage of the interpretation and parsing of the SPARQL results:

store = pynappl.Store(store_uri, username, password)
(response, body) = store.select("select * where {?s a ?o} limit 10")
(sparql_header, results) = body
for result in results:
    print "%s (a %s)" % (str(result['s']), str(result['o']))

This can be switched off to get at the raw response body:

(response, body) = store.select("select * where {?s a ?o} limit 10", True)
print body

Also included is a command line application called tstore that wraps up a lot of these operations, including waiting for batch operations to complete. For example, to reset a store and load data into it takes just two lines:

./tstore --store mystore --user username --password xxxx reset --wait
./tstore --store mystore --user username --password xxxx store -f data.rdf

Please take a look at Pynappl and let me know what you think or of you’d like to get involved and help out.

About Pynappl… Pynappl is a simple open source Python library for working with the Talis Platform. Currently it is focussed mainly on managing data loading and manipulation of Talis Platform stores. Pynappl is an early alpha and is substantially incomplete (we’re looking for interested contributors. You can read more about Pynappl at its Google Code project page

SPARQL 1.1 Early Access Features

In yesterday’s monthly Talis Platform release we started rolling out some early access support for the SPARQL 1.1 query language. We’ve been monitoring the activity around the development of SPARQL extensions for some time and have been watching the Working Group’s activity to get a feel for which new features are to be included in the forthcoming revision to the language. For those of you interested in some background on that then Lee Feigenbaum has a nice presentation that summarizes the working groups current thinking.

One major missing feature from SPARQL 1.0 was support for aggregates, i.e. the ability to count, sum and group results. These features have already been implemented by a number of triple stores and this work will get standardised as part of SPARQL 1.1. Because of our confidence in this feature being added to the specification; the existing implementation experience; and in response to customer feedback we have decided to release early access support for these specific features as an experimental enhancement to the Platform SPARQL endpoint.

The documentation on the developer wiki has been updated to start to itemize the supported SPARQL extensions.

Users should be aware that the syntax of the extensions may be subject to change as we’ll be attempting to track the progress of the working group as they clarify the specification of these features for inclusion in the standard. We’ll provide notice of any expected changes.

Users should also be aware that while the basic functionality of aggregates is supported in a number of other implementations, care should be taken if queries are intended to be portable across different triplestores and/or services. For example, the Talis Platform contains some mirrors of other datasets so queries written to use the new functionality may not be portable across other services due to the basic feature not being supported or due to minor syntactic differences.

With the warnings out of the way, here are some simple examples of the extensions in practice. The first query uses the BBC programmes and music data hosted in the platform, and asks for the number of albums release by the Prodigy. The query uses the count() function to count up the number of album titles. The results of the count are assigned to a variable called ?count in the SELECT clause using the new “SELECT expression” syntax.


#How many albums have been released by The Prodigy?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rev: <http://purl.org/stuff/rev#>
SELECT (count(?title) as ?count) WHERE {
  ?group a mo:MusicGroup;
      foaf:name "The Prodigy";
       foaf:made ?album.
   ?album dc:title ?title.
}

Results.

The second example is a variant of one of the example queries that can be used against the Edubase data. In this case the query retrieves the number of schools closed in each parliamentary constituency in 2008, ordering the results in descending order. The new GROUP BY keyword is used to group the results by the label of the constituency.


#How many schools closed in each parliamentary constituency in 2008?
#In descending order of number of closures
prefix sch-ont:  <http://education.data.gov.uk/ontology/school#>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
ORDER BY DESC(?count)

Results.

We can revise this query to only include those constituencies in which at least 10 schools have closed. To do this we need to filter the results to just those where the count is equal to or greater than 10. The new HAVING keyword allows an expression to be applied to the result set before it is returned:


prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
HAVING (?count >= 10)
ORDER BY DESC(?count)

Results.

The SPARQL extensions page includes a few more examples of the syntax and a list of the operators now supported in the extended query language. Any feedback or questions, then please leave a comment below.

SPARQLing data.gov.uk: Transport Data

This is the second in my series of posts about using SPARQL to access the Linked Data being published from data.gov.uk. In the first article I looked at the Edubase data. In this second post I wanted to briefly look at some of the data from the Department of Transport. This dataset, which consists of around 45 million triples provides data about traffic counts on UK roads. Jeni Tennison has previously written up how she approached the dataset conversion and published it online as part of the data.gov.uk initiative, so her blog post is a useful starting point for background on the structure and content of the dataset.

The SPARQL endpoint for the transport data in data.gov.uk is at: http://services.data.gov.uk/transport/sparql.

Each of the road traffic monitoring points in the dataset has latitude and longitude details available, so it is possible to ask for all collection points that occur on a particular road. Here’s how to do that for the M5:


#List the uri, latitude and longitude for road traffic monitoring points on the M5
PREFIX road: <http://transport.data.gov.uk/0/ontology/roads#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://geo.data.gov.uk/0/ontology/geo#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?point ?lat ?long WHERE {
  ?x a road:Road.
  ?x road:number "M5"^^xsd:NCName.
  ?x geo:point ?point.
  ?point wgs84:lat ?lat.
  ?point wgs84:long ?long.
}

Results.

To modify the query to look at a different road, just change the query to refer to another road name, e.g. the B237 or the A4.

If you’d prefer not to deal with the SPARQL XML Results format, then you can add an parameter to the url to request the results in the SPARQL JSON results format (output=json). Here are the points on the A4 as JSON.

If you query further you can find all of the traffic counts associated with a particular location, each of these has a timestamp, the direction the traffic was travelling, etc. The data is ripe for visualisation, e.g. plotting the points on a map, building an animation to show traffic changes over time, etc.

The dataset also includes identifiers for different types of road and motor vehicle. These are published as SKOS concept schemes (i.e. a category of stuff). SKOS concept schemes are hierarchical, so lets see what schemes are in the data, and what their top concept is:


#List SKOS concept schemes, their top concepts and labels
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?scheme ?topconcept ?label WHERE {
  ?scheme a skos:ConceptScheme;
    skos:hasTopConcept ?topconcept.
  ?topconcept skos:prefLabel ?label.
}

Results.

The above query will work on any dataset as it just uses generic SKOS vocabulary. You could run it on any SPARQL endpoint to see if it contains some SKOS concept schemes.

One of the schemes in the dataset is a categorization of roads. Lets retrieve the concepts in that scheme:


PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?category ?label WHERE {
  ?category skos:inScheme ;
   skos:prefLabel ?label.
}

Results.

If we wanted to look at the concepts in the vehicle scheme (http://transport.data.gov.uk/0/category/vehicle), then we can just change the relevant URI in the query and retrieve the results.

Based on that information it should be possible to find traffic counts for specific types of vehicle on specific roads. I’ll leave that as an exercise for the reader!

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (“Red Dwarf”) may be related to a number of Series (“Red Dwarf Series 1″) and a Series is compromised of Episodes (“Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (“masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (“you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.