Subscribe

Author Archive

SPARQL 1.1 Early Access Features

In yesterday’s monthly Talis Platform release we started rolling out some early access support for the SPARQL 1.1 query language. We’ve been monitoring the activity around the development of SPARQL extensions for some time and have been watching the Working Group’s activity to get a feel for which new features are to be included in the forthcoming revision to the language. For those of you interested in some background on that then Lee Feigenbaum has a nice presentation that summarizes the working groups current thinking.

One major missing feature from SPARQL 1.0 was support for aggregates, i.e. the ability to count, sum and group results. These features have already been implemented by a number of triple stores and this work will get standardised as part of SPARQL 1.1. Because of our confidence in this feature being added to the specification; the existing implementation experience; and in response to customer feedback we have decided to release early access support for these specific features as an experimental enhancement to the Platform SPARQL endpoint.

The documentation on the developer wiki has been updated to start to itemize the supported SPARQL extensions.

Users should be aware that the syntax of the extensions may be subject to change as we’ll be attempting to track the progress of the working group as they clarify the specification of these features for inclusion in the standard. We’ll provide notice of any expected changes.

Users should also be aware that while the basic functionality of aggregates is supported in a number of other implementations, care should be taken if queries are intended to be portable across different triplestores and/or services. For example, the Talis Platform contains some mirrors of other datasets so queries written to use the new functionality may not be portable across other services due to the basic feature not being supported or due to minor syntactic differences.

With the warnings out of the way, here are some simple examples of the extensions in practice. The first query uses the BBC programmes and music data hosted in the platform, and asks for the number of albums release by the Prodigy. The query uses the count() function to count up the number of album titles. The results of the count are assigned to a variable called ?count in the SELECT clause using the new “SELECT expression” syntax.


#How many albums have been released by The Prodigy?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rev: <http://purl.org/stuff/rev#>
SELECT (count(?title) as ?count) WHERE {
  ?group a mo:MusicGroup;
      foaf:name "The Prodigy";
       foaf:made ?album.
   ?album dc:title ?title.
}

Results.

The second example is a variant of one of the example queries that can be used against the Edubase data. In this case the query retrieves the number of schools closed in each parliamentary constituency in 2008, ordering the results in descending order. The new GROUP BY keyword is used to group the results by the label of the constituency.


#How many schools closed in each parliamentary constituency in 2008?
#In descending order of number of closures
prefix sch-ont:  <http://education.data.gov.uk/ontology/school#>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
ORDER BY DESC(?count)

Results.

We can revise this query to only include those constituencies in which at least 10 schools have closed. To do this we need to filter the results to just those where the count is equal to or greater than 10. The new HAVING keyword allows an expression to be applied to the result set before it is returned:


prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
HAVING (?count >= 10)
ORDER BY DESC(?count)

Results.

The SPARQL extensions page includes a few more examples of the syntax and a list of the operators now supported in the extended query language. Any feedback or questions, then please leave a comment below.

SPARQLing data.gov.uk: Transport Data

This is the second in my series of posts about using SPARQL to access the Linked Data being published from data.gov.uk. In the first article I looked at the Edubase data. In this second post I wanted to briefly look at some of the data from the Department of Transport. This dataset, which consists of around 45 million triples provides data about traffic counts on UK roads. Jeni Tennison has previously written up how she approached the dataset conversion and published it online as part of the data.gov.uk initiative, so her blog post is a useful starting point for background on the structure and content of the dataset.

The SPARQL endpoint for the transport data in data.gov.uk is at: http://services.data.gov.uk/transport/sparql.

Each of the road traffic monitoring points in the dataset has latitude and longitude details available, so it is possible to ask for all collection points that occur on a particular road. Here’s how to do that for the M5:


#List the uri, latitude and longitude for road traffic monitoring points on the M5
PREFIX road: <http://transport.data.gov.uk/0/ontology/roads#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://geo.data.gov.uk/0/ontology/geo#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?point ?lat ?long WHERE {
  ?x a road:Road.
  ?x road:number "M5"^^xsd:NCName.
  ?x geo:point ?point.
  ?point wgs84:lat ?lat.
  ?point wgs84:long ?long.
}

Results.

To modify the query to look at a different road, just change the query to refer to another road name, e.g. the B237 or the A4.

If you’d prefer not to deal with the SPARQL XML Results format, then you can add an parameter to the url to request the results in the SPARQL JSON results format (output=json). Here are the points on the A4 as JSON.

If you query further you can find all of the traffic counts associated with a particular location, each of these has a timestamp, the direction the traffic was travelling, etc. The data is ripe for visualisation, e.g. plotting the points on a map, building an animation to show traffic changes over time, etc.

The dataset also includes identifiers for different types of road and motor vehicle. These are published as SKOS concept schemes (i.e. a category of stuff). SKOS concept schemes are hierarchical, so lets see what schemes are in the data, and what their top concept is:


#List SKOS concept schemes, their top concepts and labels
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?scheme ?topconcept ?label WHERE {
  ?scheme a skos:ConceptScheme;
    skos:hasTopConcept ?topconcept.
  ?topconcept skos:prefLabel ?label.
}

Results.

The above query will work on any dataset as it just uses generic SKOS vocabulary. You could run it on any SPARQL endpoint to see if it contains some SKOS concept schemes.

One of the schemes in the dataset is a categorization of roads. Lets retrieve the concepts in that scheme:


PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?category ?label WHERE {
  ?category skos:inScheme ;
   skos:prefLabel ?label.
}

Results.

If we wanted to look at the concepts in the vehicle scheme (http://transport.data.gov.uk/0/category/vehicle), then we can just change the relevant URI in the query and retrieve the results.

Based on that information it should be possible to find traffic counts for specific types of vehicle on specific roads. I’ll leave that as an exercise for the reader!

SPARQLing data.gov.uk: Edubase Data

Last week the Cabinet Office issued a call for Open Data Developers to sign-up to get a preview of the forthcoming UK Government public data website. The site includes a directory of existing datasets plus a growing number of datasets that have been converted to RDF and which will shortly be available as Linked Data. This data is being stored in the Talis Platform providing developers with access to SPARQL endpoints as a means to query the data; we’ll also be including search and other access mechanisms at a later date.

In this series of postings I wanted to show some example SPARQL queries that can be used to access the data. If you’re new to SPARQL then you might want to look at Lee Feigenbaum’s SPARQL by Example tutorial, or my own short slide deck that covers all the basic syntax.

The first dataset I wanted to highlight is an extract of the Edubase dataset available from the Department of Children, Schools and Families. The conversion was carried out by the team at HP Labs and has been loaded into a Talis Platform store. The public facing SPARQL endpoint is available from: http://services.data.gov.uk/education/sparql.

Here are some sample SPARQL queries you can use against the data:


#1. Select the names of schools in the Administrative District of the City of London
# Ordering results by name of the school
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> ;
}
ORDER BY ?name

Results


#2. Which schools in the BANES area have a nursery?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> ;
     sch-ont:nurseryProvision "true"^^xsd:boolean
}
ORDER BY ?name

Results


#3. Select the names and addresses of schools in the Administrative District of the City of London
# Ordering results by name of the school
# Note: we use OPTIONAL here as not every school has an address listed in the data
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?address1 ?address2 ?postcode ?town WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> .

  OPTIONAL {
   ?school sch-ont:address ?address .
  ?address sch-ont:address1 ?address1 ;
      sch-ont:address2 ?address2 ;
      sch-ont:postcode ?postcode ;
      sch-ont:town ?town .
  }
}
ORDER BY ?name

Results


#4. Select the name, lowest and highest age ranges, capacity and pupil:teacher ratio
# for all schools in the Bath & North East Somerset district
# Again we use OPTIONAL to allow for missing data items.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?lowage ?highage ?capacity ?ratio WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> .
     OPTIONAL {
       ?school sch-ont:statutoryLowAge ?lowage ;
     }

     OPTIONAL {
       ?school sch-ont:statutoryHighAge ?highage ;
     }

     OPTIONAL {
       ?school sch-ont:schoolCapacity ?capacity ;
     }

     OPTIONAL {
       ?school sch-ont:pupilTeacherRatio ?ratio
     }
}
ORDER BY ?name

Results


#5. What is the uri, name, and opening date of the oldest school in the UK?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date.
}
ORDER BY ASC(?date)
LIMIT 1

Results


#6. Select the name, easting and northing for the 100 newest schools in the UK.
# Can be used to plot them on a map
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
}
ORDER BY DESC(?date)
LIMIT 100

Results


#7. Select the uri, name, easting and northing for all schools opened in 2008
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}

Results


#8. Select the uri, name, and the reason for closing for all schools that are currently
# scheduled for closure. The reason is a URI from a controlled vocabulary in the ontology.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?reason WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Open_but_proposed_to_close ;
     sch-ont:reasonEstablishmentClosed ?reason .
}

Results


#9. In which parliamentary constituencies did schools close in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results


#10. In which parliamentary constituencies did schools open in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:openDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results

Hopefully that’s enough to get you started. If you want a bit more background on the modelling and a look at the ontology, then read this posting to the uk-government-data mailing list by Stuart Williams.

note: updated 16 Nov 2009 to reflect changes to the EduBase data. The first version of this dataset was created before the proposed guidelines for public sector URIs was published. The school ontology used in that first dataset had a URI of http://education.data.gov.uk/ontology/school# which has now been replaced with http://education.data.gov.uk/def/school/. Also the URIs for administrative districts were temporary placeholders containing the phrase “placeholder-id” in their path. These have now been updated to URIs based on the Office for National Statistics district codes, for example http://statistics.data.gov.uk/id/local-authority-district/00AA

Visualising BBC Programme Categories

Whilst I was exploring the BBC programmes data looking for possible demonstration applications I thought it might be interesting to try and create a visualisation of the relationships between different categories of BBC programmes The BBC datasets use SKOS as a categorization scheme, with separate taxonomies for formats (e.g. documentaries, animation, etc) and genres (e.g. childrens programmes, science fiction, etc). If you poke around a little, you can also see a nascent category system for places and people, although there doesn’t seem to be much data there at present (and what is there seems to change regularly).

For my purposes, the genre classifications looked most interesting. Episodes are associated with their genre category via the po:category property. As I was interested in finding relationships between genres, what I was looking for was a way to relate together individual categories, other than by the obvious super/sub-category relationship.

It occured to me that if two categories were associated with the same episode, then this could be viewed as a declaration of some implicit relationship between the categories. Extracting this in SPARQL is straight-forward, as we just need to match episodes that have more than one category:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

In the above SPARQL query we match any episode that has at least two categories (because we use two po:category patterns), and where those categories are different (in the FILTER). This excludes the unwanted result where the ?category and ?related variables are bound to the same value. I didn’t bother with pruning out duplicates as this could easily be done on the client-side.

In order to visualise the results, I decided to use MooWheel. This provides a simple Javascript visualisation toolkit for presenting connections between a set of resources. MooWheel can be configured using a JSON data structure, so generating a a MooWheel visualisation from a SPARQL query is relatively straight-forward: the query results can be retrieved as SPARQL/JSON which can then be massaged into the appropriate JSON structure to generate the MooWheel visualisation. Check out the source code of the demonstration for sample code to do this (look at the success callback).

My first attempt at a visualisation simple executed the above query across the entire BBC dataset. This generated a huge wheel of connections between the categories, but ultimately the visualisation wasn’t that useful. So I decided to refine the visualisation to generate separate category wheels for each of the main BBC TV channels. This involved refining the SPARQL query to include an extra triple pattern to limit Episodes to just those associated with a specific channel (po:masterbrand). The following revised query restricts results to BBC 1:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:masterbrand ;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

The results of this visualisation is much more interesting.

Each of the BBC channels has a different range of programming and this emphasis is really clear in the visualisation. Compare for example BBC 1 and BBC 3, or either with BBC 4. For those of us in the UK who have already internalised this, there may not be a great deal of new information here, but its nice to see how this feature of the dataset can be easily surfaced with very little effort. There’s more analysis that could be done here though, particularly if the BBC open up their programme archives. For example, how do the range of programme categories for a channel change over time? Which programmes actually link the different categories together? Could other visualisations provide more insight into the programming than a simple relationship wheel? For example, could a treemap style visualisation give some indication of the amount of schedule time devoted to a particular category of programme?

Why not see what you can come up with?

Presenting BBC and NASA data using Freemix and SPARQL

One of the most interesting applications I saw at the recent Semantic Technology conference was Freemix. The application, which is currently in limited beta, allows anyone to easily create customized views over data that they upload into the system. There’s also the usual networking features providing an additional social dimension to data sharing and publishing. As I understand it Zepheira have plans for expanding the range of features in all sorts of ways, including new visualisations, the ability to merge and remix data from several sources, and naturally enough a commercial version that can be deployed within the enterprise.

The core of Freemix is Simile Exhibit and a drag and drop interface for building up an Exhibit presentation over data that the user has uploaded. Data can be presented in several different ways, including simple tabular spreadsheets and in the Exhibit JSON format. Exhibit provides a number of different existing views suitable for presenting data, including lists, tables, maps, timelines, etc. As a web developer its straight-forward to build up your own Exhibits; Freemix takes this to the next level, making it trivial to build a presentation in just a few clicks, without the need to learn any markup: you just have to understand your own data.

Naturally enough I was curious to know whether Freemix could be used to build presentations of Linked Data, and specifically whether I could feed it with data from the Talis Platform. I’ve been working with the BBC data quite extensively recently, and have been compiling a space flight dataset. So I thought I’d use those as my test cases. Both of these datasets are in Platform stores, so I explored the options for extracting data using a SPARQL query in order to build a presentation in Freemix. It turns out its really easy.

Freemix supports importing JSON data from a URL, so I knew that in theory I could write a SPARQL query against a Platform store and use the SPARQL protocol request URL as the import target. As I didn’t want to extract the whole dataset, just some interesting subset for my presentation, a SPARQL CONSTRUCT query seemed like the best option. Like Exhibit, Freemix requires a relatively flat data structure — i.e. resources with properties, rather than a true directed graph. This means that within the CONSTRUCT query I would need to simplify the graph structure, removing some of the richer modelling, to re-shape the data to fit Freemix’s expectations.

Here’s a query I came up with for my NASA data:


PREFIX rdfs:
PREFIX dc:
PREFIX space:
PREFIX xsd:
PREFIX foaf: 

CONSTRUCT {
?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launched ?launched;
dc:description ?description;
space:discipline ?label.
}
WHERE {
?launch space:launched ?launched.

?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launch ?launch;
dc:description ?description;
space:discipline ?discipline.

?discipline rdfs:label ?label.

FILTER (?launched > "2005-01-01"^^xsd:date)
}

The query finds all spacecraft launched since 2005, extracting the name, agency, mass, etc. The labels of the disciplines (subject categories) and the launch dates which are originally associated with separate resources in the underlying graph, are re-presented here as properties of the spacecraft itself. Not ideal in a modelling or data interchange perspective, but a reasonable trade-off for shaping data for presentation purposes.

So far so good. The Talis Platform supports a range of output options from CONSTRUCT queries including both RDF/XML and RDF/JSON. Unfortunately Freemix doesn't support RDF/JSON as an input option although this would make a nice addition to the range of import options. In order to convert from the RDF/XML to the Exhibit/JSON format for Freemix I used the Talis Morph service. Morph is a simple service that provides a number of options for converting between semantic web formats. RDF/XML to Exhibit/JSON is one of those options, so all I needed to do was pipe the original SPARQL query URL through the morph service to get my final import target for Freemix.

You can view the imported data on my Freemix homepage. And here's a presentation of that same data. As you can see the presentation provides a list and table views, piecharts that break down launches by agency and discipline, and also a timeline view of the launches. This was incredibly quick to put together.

I tried the same approach with some BBC data. So here's a simple Dr Who episode guide as a Freemix. The presentation options are a little more limited here, partly because there aren't as many natural facets to the BBC data, but also because Freemix doesn't (yet?) offer the ability to, e.g. create a coverflow presentation of images, or a tag cloud over blocks of text. The ability to mark fields as numbers and sort tables by multiple fields would also be useful. Having said that, trying searching for "Rose" in the search box to see which episodes descriptions mention her; note that the series facet on the left also automatically updates.

As you can see from the SPARQL query, some massaging of the graph structure was required to include series titles against each episodes.


PREFIX foaf:
PREFIX rdfs:
PREFIX po:
PREFIX dc:
PREFIX freemix: 

CONSTRUCT {

?episode a po:Episode;
foaf:depiction ?depiction;
freemix:seriesTitle ?seriestitle;
dc:title ?title;

po:position ?position;
po:short_synopsis ?syn.
}
WHERE
{
po:series ?series.

?series dc:title ?seriestitle;
po:episode ?episode.

?episode a po:Episode;
foaf:depiction ?depiction;
dc:title ?title;
po:position ?position;
po:short_synopsis ?syn.

}

My only other issue with Freemix is the live-ness of the data. Ideally instead of having to import data directly into the system, it should instead be fetched from source either on demand or on a regular basis. I suspect this is the kind of feature that will end up in a commercial version of the product.

Overall though I was quite pleased with how easy it was to create these kinds of presentations. I'm convinced that for Linked Data to truly hit the mainstream we need simple tools like Freemix that let all of us easily compile and create custom presentations of data. Obviously, we also need to be able to easily select the data that we want to display, and very few people will want to bother with SPARQL queries. So I think there is some interesting work to be done to create SPARQL query builders that tie into browsers, e.g. so I can select the data facets I'm interested in as I browse and then choose to represent those facets in different ways.

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (“Red Dwarf”) may be related to a number of Series (“Red Dwarf Series 1″) and a Series is compromised of Episodes (“Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (“masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (“you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

Quick OpenCalais Hack

I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.

One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.

Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:

sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais

With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:


# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'

store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status

The code is here, and here’s some sample input and sample output.

The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results

So how do I plan to use this?

As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.

The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.

If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.

I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.

Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.