Subscribe

SPARQL Hacks: moving query logic into data

There are too many terms that mean the same thing sometimes. Take labels. rdfs:label is perhaps the most obvious choice if you want to label something in RDF, but there are a whole bunch of semantically equivalent predicates in high usage for doing the same thing. For a while, it seems, it was common practice for every vocabulary to define their own equivalent – though very few bother to rdfs:subPropertyOf rdfs:label (and some predate rdfs:label), so even if you can do some reasoning in your query engine, this might not help you much. So when you want to get the label for something, but you don’t know which predicate the data uses, you might end up doing something like this:


construct { ?s rdfs:label ?l }
where
{
?s ?p ?o
optional
{ ?s rdfs:label ?l }
optional
{ ?s foaf:name ?l }
optional
{ ?s sioc:name ?l }
optional
{ ?s dc:title ?l }
optional
{ ?s dcterms:title ?l }
}

Nasty. And maybe later you find another label predicate in the data somewhere and have to go modify your queries.

But, if I add these triples to my store:


<#a> rdfapp:labelPredicate dc:title, rdfs:label, dcterms:title foaf:name, sioc:name .

I can instead do:


prefix rdfapp: <http://kwijibo.talis.com/vocabs/rdfapp#>
construct { ?s rdfs:label ?l }
where
{
<#a> rdfapp:labelPredicate ?labelPredicate .
?s ?labelPredicate ?l .
}

voiD stores and Interesting Queries

Amongst the best incentives for data authors are applications that use that data. One sort of data that especially interests me is dataset metadata, for which the voiD vocabulary was developed; I think this kind of data has the potential to enable the future generation of web apps to join together the ever-growing web of data in wild and exciting new ways. So I was pretty pleased when I saw the voiD store from RKB Explorer. This store provides a SPARQL endpoint over all the voiD descriptions RKB Explorer have produced about their datasets, plus some descriptions they’ve gathered about other datasets. It also provides a list of source documents, sample queries, and a service that takes a list of URIs, and returns a list of SPARQL endpoints that might be able to return triples about them.

This, together with a rainy weekend, prompted me to try out some simple voiD-related things I’d been thinking of. I’ve also been aggregating voiD data in one of my dev stores. This is done partly by creating templated descriptions from a list of Talis Platform stores and poking at them with some SPARQL queries. The rest of the data I found either manually, or by querying Sindice for a list of void:Dataset URIs found in the documents they’ve crawled.

The Sindice API allows you to specify triple patterns with wildcards, and will return you an Atom feed: * rdf:type void:Dataset . I page through the results, importing the RDF from the URIs into my store.

One of my favourite terms from voiD is void:uriRegexPattern, which can be used to indicate that if a URI matches the pattern, the dataset might contain some triples about that URI. You can do this with a bit of SPARQL:

    
prefix void: <http://rdfs.org/ns/void#>
DESCRIBE ?dataset {
     ?dataset void:uriRegexPattern ?regex ; void:sparqlEndpoint ?sparql ; a void:Dataset .

    FILTER(REGEX("http://example.com/my/uri", ?regex))
}

    

The novel thing here is that normally, when you use REGEX() in SPARQL, you put a variable binding in the first parameter position, and hardcode a regular expression into the query in the 2nd position. Here though, the regex is in the data, and it is the string against which it is evaluated which is hardcoded, and the variable binding contains the regex. (Unfortunately, while this works with ARQ, it doesn’t appear to work with 3Store – which is perhaps why the rkbexplorer voiD Store provides this as a separate web service).

So, I’ve used this to create a page that will take a URI, and query my voiD store for void:sparqlEndpoints and void:uriLookupEndpoints, which it will then call to retrieve triples and render them on the page. Here is a query for the URI http://climb.dataincubator.org/dataset .

Another query that interested me, which has become possible since the Platform introduced support for the COUNT() function from SPARQL 1.1, is, which are the most commonly used vocabularies? (SIOC and FOAF so far! – thought this is because I generated many of these triples based on scripted prodding of endpoints with ASK queries) But then I wanted to be able to see easily which datasets used which vocabularies, so I created some pages to let me browse datasets by vocabulary.

  1. SIOC Core Ontology Namespace(54)
  2. Friend of a Friend (FOAF) vocabulary(42)
  3. Coreference Ontology (35)
  4. http://www.aktors.org/ontology/portal# (34)

  5. http://www.aktors.org/ontology/support# (30)
  6. http://www.rkbexplorer.com/ontologies/resist# (30)
  7. void (25)
  8. http://purl.org/NET/scovo# (24)
  9. http://acm.rkbexplorer.com/ontologies/acm# (22)
  10. http://courseware.rkbexplorer.com/ontologies/courseware# (21)

Then I made some pages to do the same thing with dct:subjects. Here, the largest category by some way, is category: online_social_networking. This is because I generated ?dataset dct:subject <http://dbpedia.org/resource/Category:Online_social_networking> . triples automatically for all the platform stores which made a certain use of terms from the SIOC ontology.

These automatically generated voiD descriptions will not, of course, present such a balanced picture of what is out there, and skew the results somewhat. The most interesting descriptions are those which are handcrafted to some extent, describing something of the nature of the dataset’s domains.

I’ve also provided a form for submitting voiD URLs to. My hope is that this simple application, together with the rkbexplorer voiD Store, might encourage more people to describe their linked data datatsets with voiD, or perhaps add more detail to the descriptions they already publish, in order to see their dataset come up in the appropriate queries. And I hope that this, in turn, will encourage others to build more sophisticated and exciting applications using that data.

SPARQL 1.1 Early Access Features

In yesterday’s monthly Talis Platform release we started rolling out some early access support for the SPARQL 1.1 query language. We’ve been monitoring the activity around the development of SPARQL extensions for some time and have been watching the Working Group’s activity to get a feel for which new features are to be included in the forthcoming revision to the language. For those of you interested in some background on that then Lee Feigenbaum has a nice presentation that summarizes the working groups current thinking.

One major missing feature from SPARQL 1.0 was support for aggregates, i.e. the ability to count, sum and group results. These features have already been implemented by a number of triple stores and this work will get standardised as part of SPARQL 1.1. Because of our confidence in this feature being added to the specification; the existing implementation experience; and in response to customer feedback we have decided to release early access support for these specific features as an experimental enhancement to the Platform SPARQL endpoint.

The documentation on the developer wiki has been updated to start to itemize the supported SPARQL extensions.

Users should be aware that the syntax of the extensions may be subject to change as we’ll be attempting to track the progress of the working group as they clarify the specification of these features for inclusion in the standard. We’ll provide notice of any expected changes.

Users should also be aware that while the basic functionality of aggregates is supported in a number of other implementations, care should be taken if queries are intended to be portable across different triplestores and/or services. For example, the Talis Platform contains some mirrors of other datasets so queries written to use the new functionality may not be portable across other services due to the basic feature not being supported or due to minor syntactic differences.

With the warnings out of the way, here are some simple examples of the extensions in practice. The first query uses the BBC programmes and music data hosted in the platform, and asks for the number of albums release by the Prodigy. The query uses the count() function to count up the number of album titles. The results of the count are assigned to a variable called ?count in the SELECT clause using the new “SELECT expression” syntax.


#How many albums have been released by The Prodigy?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX rel: <http://purl.org/vocab/relationship/>
PREFIX rev: <http://purl.org/stuff/rev#>
SELECT (count(?title) as ?count) WHERE {
  ?group a mo:MusicGroup;
      foaf:name "The Prodigy";
       foaf:made ?album.
   ?album dc:title ?title.
}

Results.

The second example is a variant of one of the example queries that can be used against the Edubase data. In this case the query retrieves the number of schools closed in each parliamentary constituency in 2008, ordering the results in descending order. The new GROUP BY keyword is used to group the results by the label of the constituency.


#How many schools closed in each parliamentary constituency in 2008?
#In descending order of number of closures
prefix sch-ont:  <http://education.data.gov.uk/ontology/school#>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
ORDER BY DESC(?count)

Results.

We can revise this query to only include those constituencies in which at least 10 schools have closed. To do this we need to filter the results to just those where the count is equal to or greater than 10. The new HAVING keyword allows an expression to be applied to the result set before it is returned:


prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (count(?school) as ?count) WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
GROUP BY ?label
HAVING (?count >= 10)
ORDER BY DESC(?count)

Results.

The SPARQL extensions page includes a few more examples of the syntax and a list of the operators now supported in the extended query language. Any feedback or questions, then please leave a comment below.

SPARQLing data.gov.uk: Edubase Data

Last week the Cabinet Office issued a call for Open Data Developers to sign-up to get a preview of the forthcoming UK Government public data website. The site includes a directory of existing datasets plus a growing number of datasets that have been converted to RDF and which will shortly be available as Linked Data. This data is being stored in the Talis Platform providing developers with access to SPARQL endpoints as a means to query the data; we’ll also be including search and other access mechanisms at a later date.

In this series of postings I wanted to show some example SPARQL queries that can be used to access the data. If you’re new to SPARQL then you might want to look at Lee Feigenbaum’s SPARQL by Example tutorial, or my own short slide deck that covers all the basic syntax.

The first dataset I wanted to highlight is an extract of the Edubase dataset available from the Department of Children, Schools and Families. The conversion was carried out by the team at HP Labs and has been loaded into a Talis Platform store. The public facing SPARQL endpoint is available from: http://services.data.gov.uk/education/sparql.

Here are some sample SPARQL queries you can use against the data:


#1. Select the names of schools in the Administrative District of the City of London
# Ordering results by name of the school
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> ;
}
ORDER BY ?name

Results


#2. Which schools in the BANES area have a nursery?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> ;
     sch-ont:nurseryProvision "true"^^xsd:boolean
}
ORDER BY ?name

Results


#3. Select the names and addresses of schools in the Administrative District of the City of London
# Ordering results by name of the school
# Note: we use OPTIONAL here as not every school has an address listed in the data
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?address1 ?address2 ?postcode ?town WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> .

  OPTIONAL {
   ?school sch-ont:address ?address .
  ?address sch-ont:address1 ?address1 ;
      sch-ont:address2 ?address2 ;
      sch-ont:postcode ?postcode ;
      sch-ont:town ?town .
  }
}
ORDER BY ?name

Results


#4. Select the name, lowest and highest age ranges, capacity and pupil:teacher ratio
# for all schools in the Bath & North East Somerset district
# Again we use OPTIONAL to allow for missing data items.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?lowage ?highage ?capacity ?ratio WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> .
     OPTIONAL {
       ?school sch-ont:statutoryLowAge ?lowage ;
     }

     OPTIONAL {
       ?school sch-ont:statutoryHighAge ?highage ;
     }

     OPTIONAL {
       ?school sch-ont:schoolCapacity ?capacity ;
     }

     OPTIONAL {
       ?school sch-ont:pupilTeacherRatio ?ratio
     }
}
ORDER BY ?name

Results


#5. What is the uri, name, and opening date of the oldest school in the UK?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date.
}
ORDER BY ASC(?date)
LIMIT 1

Results


#6. Select the name, easting and northing for the 100 newest schools in the UK.
# Can be used to plot them on a map
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
}
ORDER BY DESC(?date)
LIMIT 100

Results


#7. Select the uri, name, easting and northing for all schools opened in 2008
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}

Results


#8. Select the uri, name, and the reason for closing for all schools that are currently
# scheduled for closure. The reason is a URI from a controlled vocabulary in the ontology.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?reason WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Open_but_proposed_to_close ;
     sch-ont:reasonEstablishmentClosed ?reason .
}

Results


#9. In which parliamentary constituencies did schools close in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results


#10. In which parliamentary constituencies did schools open in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:openDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results

Hopefully that’s enough to get you started. If you want a bit more background on the modelling and a look at the ontology, then read this posting to the uk-government-data mailing list by Stuart Williams.

note: updated 16 Nov 2009 to reflect changes to the EduBase data. The first version of this dataset was created before the proposed guidelines for public sector URIs was published. The school ontology used in that first dataset had a URI of http://education.data.gov.uk/ontology/school# which has now been replaced with http://education.data.gov.uk/def/school/. Also the URIs for administrative districts were temporary placeholders containing the phrase “placeholder-id” in their path. These have now been updated to URIs based on the Office for National Statistics district codes, for example http://statistics.data.gov.uk/id/local-authority-district/00AA

Visualising BBC Programme Categories

Whilst I was exploring the BBC programmes data looking for possible demonstration applications I thought it might be interesting to try and create a visualisation of the relationships between different categories of BBC programmes The BBC datasets use SKOS as a categorization scheme, with separate taxonomies for formats (e.g. documentaries, animation, etc) and genres (e.g. childrens programmes, science fiction, etc). If you poke around a little, you can also see a nascent category system for places and people, although there doesn’t seem to be much data there at present (and what is there seems to change regularly).

For my purposes, the genre classifications looked most interesting. Episodes are associated with their genre category via the po:category property. As I was interested in finding relationships between genres, what I was looking for was a way to relate together individual categories, other than by the obvious super/sub-category relationship.

It occured to me that if two categories were associated with the same episode, then this could be viewed as a declaration of some implicit relationship between the categories. Extracting this in SPARQL is straight-forward, as we just need to match episodes that have more than one category:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

In the above SPARQL query we match any episode that has at least two categories (because we use two po:category patterns), and where those categories are different (in the FILTER). This excludes the unwanted result where the ?category and ?related variables are bound to the same value. I didn’t bother with pruning out duplicates as this could easily be done on the client-side.

In order to visualise the results, I decided to use MooWheel. This provides a simple Javascript visualisation toolkit for presenting connections between a set of resources. MooWheel can be configured using a JSON data structure, so generating a a MooWheel visualisation from a SPARQL query is relatively straight-forward: the query results can be retrieved as SPARQL/JSON which can then be massaged into the appropriate JSON structure to generate the MooWheel visualisation. Check out the source code of the demonstration for sample code to do this (look at the success callback).

My first attempt at a visualisation simple executed the above query across the entire BBC dataset. This generated a huge wheel of connections between the categories, but ultimately the visualisation wasn’t that useful. So I decided to refine the visualisation to generate separate category wheels for each of the main BBC TV channels. This involved refining the SPARQL query to include an extra triple pattern to limit Episodes to just those associated with a specific channel (po:masterbrand). The following revised query restricts results to BBC 1:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:masterbrand ;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

The results of this visualisation is much more interesting.

Each of the BBC channels has a different range of programming and this emphasis is really clear in the visualisation. Compare for example BBC 1 and BBC 3, or either with BBC 4. For those of us in the UK who have already internalised this, there may not be a great deal of new information here, but its nice to see how this feature of the dataset can be easily surfaced with very little effort. There’s more analysis that could be done here though, particularly if the BBC open up their programme archives. For example, how do the range of programme categories for a channel change over time? Which programmes actually link the different categories together? Could other visualisations provide more insight into the programming than a simple relationship wheel? For example, could a treemap style visualisation give some indication of the amount of schedule time devoted to a particular category of programme?

Why not see what you can come up with?

Presenting BBC and NASA data using Freemix and SPARQL

One of the most interesting applications I saw at the recent Semantic Technology conference was Freemix. The application, which is currently in limited beta, allows anyone to easily create customized views over data that they upload into the system. There’s also the usual networking features providing an additional social dimension to data sharing and publishing. As I understand it Zepheira have plans for expanding the range of features in all sorts of ways, including new visualisations, the ability to merge and remix data from several sources, and naturally enough a commercial version that can be deployed within the enterprise.

The core of Freemix is Simile Exhibit and a drag and drop interface for building up an Exhibit presentation over data that the user has uploaded. Data can be presented in several different ways, including simple tabular spreadsheets and in the Exhibit JSON format. Exhibit provides a number of different existing views suitable for presenting data, including lists, tables, maps, timelines, etc. As a web developer its straight-forward to build up your own Exhibits; Freemix takes this to the next level, making it trivial to build a presentation in just a few clicks, without the need to learn any markup: you just have to understand your own data.

Naturally enough I was curious to know whether Freemix could be used to build presentations of Linked Data, and specifically whether I could feed it with data from the Talis Platform. I’ve been working with the BBC data quite extensively recently, and have been compiling a space flight dataset. So I thought I’d use those as my test cases. Both of these datasets are in Platform stores, so I explored the options for extracting data using a SPARQL query in order to build a presentation in Freemix. It turns out its really easy.

Freemix supports importing JSON data from a URL, so I knew that in theory I could write a SPARQL query against a Platform store and use the SPARQL protocol request URL as the import target. As I didn’t want to extract the whole dataset, just some interesting subset for my presentation, a SPARQL CONSTRUCT query seemed like the best option. Like Exhibit, Freemix requires a relatively flat data structure — i.e. resources with properties, rather than a true directed graph. This means that within the CONSTRUCT query I would need to simplify the graph structure, removing some of the richer modelling, to re-shape the data to fit Freemix’s expectations.

Here’s a query I came up with for my NASA data:


PREFIX rdfs:
PREFIX dc:
PREFIX space:
PREFIX xsd:
PREFIX foaf: 

CONSTRUCT {
?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launched ?launched;
dc:description ?description;
space:discipline ?label.
}
WHERE {
?launch space:launched ?launched.

?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launch ?launch;
dc:description ?description;
space:discipline ?discipline.

?discipline rdfs:label ?label.

FILTER (?launched > "2005-01-01"^^xsd:date)
}

The query finds all spacecraft launched since 2005, extracting the name, agency, mass, etc. The labels of the disciplines (subject categories) and the launch dates which are originally associated with separate resources in the underlying graph, are re-presented here as properties of the spacecraft itself. Not ideal in a modelling or data interchange perspective, but a reasonable trade-off for shaping data for presentation purposes.

So far so good. The Talis Platform supports a range of output options from CONSTRUCT queries including both RDF/XML and RDF/JSON. Unfortunately Freemix doesn't support RDF/JSON as an input option although this would make a nice addition to the range of import options. In order to convert from the RDF/XML to the Exhibit/JSON format for Freemix I used the Talis Morph service. Morph is a simple service that provides a number of options for converting between semantic web formats. RDF/XML to Exhibit/JSON is one of those options, so all I needed to do was pipe the original SPARQL query URL through the morph service to get my final import target for Freemix.

You can view the imported data on my Freemix homepage. And here's a presentation of that same data. As you can see the presentation provides a list and table views, piecharts that break down launches by agency and discipline, and also a timeline view of the launches. This was incredibly quick to put together.

I tried the same approach with some BBC data. So here's a simple Dr Who episode guide as a Freemix. The presentation options are a little more limited here, partly because there aren't as many natural facets to the BBC data, but also because Freemix doesn't (yet?) offer the ability to, e.g. create a coverflow presentation of images, or a tag cloud over blocks of text. The ability to mark fields as numbers and sort tables by multiple fields would also be useful. Having said that, trying searching for "Rose" in the search box to see which episodes descriptions mention her; note that the series facet on the left also automatically updates.

As you can see from the SPARQL query, some massaging of the graph structure was required to include series titles against each episodes.


PREFIX foaf:
PREFIX rdfs:
PREFIX po:
PREFIX dc:
PREFIX freemix: 

CONSTRUCT {

?episode a po:Episode;
foaf:depiction ?depiction;
freemix:seriesTitle ?seriestitle;
dc:title ?title;

po:position ?position;
po:short_synopsis ?syn.
}
WHERE
{
po:series ?series.

?series dc:title ?seriestitle;
po:episode ?episode.

?episode a po:Episode;
foaf:depiction ?depiction;
dc:title ?title;
po:position ?position;
po:short_synopsis ?syn.

}

My only other issue with Freemix is the live-ness of the data. Ideally instead of having to import data directly into the system, it should instead be fetched from source either on demand or on a regular basis. I suspect this is the kind of feature that will end up in a commercial version of the product.

Overall though I was quite pleased with how easy it was to create these kinds of presentations. I'm convinced that for Linked Data to truly hit the mainstream we need simple tools like Freemix that let all of us easily compile and create custom presentations of data. Obviously, we also need to be able to easily select the data that we want to display, and very few people will want to bother with SPARQL queries. So I think there is some interesting work to be done to create SPARQL query builders that tie into browsers, e.g. so I can select the data facets I'm interested in as I browse and then choose to represent those facets in different ways.

Understanding the Big BBC Graph

In the lead up to the announcement of the BBC SPARQL endpoint trials I’ve spent quite a bit of time working with and exploring the BBC /programmes and /music dataset. I thought it would be useful to write-up some of this to help out those of you looking to explore the data using the Talis Platform SPARQL endpoint. (Tip: use the newer SPARQL form for a better user experience when exploring the data.

What’s in the Store?

Currently the Platform store includes metadata for over 360,000 Radio and TV programme Episodes along with information on which Versions of those programmes have been broadcast, including the time and channel on which they were shown. Information is also available for 6,500 Series, and 5,500 Brands and their relationships, for more on that see below.

For the music data, the endpoint includes all of the artist and albums metadata currently available from the BBC Music website, which compromises over 23,000 solo artists, 11,000 groups, and 25,000 albums. There are also nearly 4,500 album reviews.

This core dataset is approximately 20 million triples, and this is obviously growing as new episodes and broadcasts are made, and as we crawl that additional data. But thats not all…

The artist metadata refers to dbpedia entries via owl:sameAs links, and this immediate context has also been included, providing a single location to query and find all the additional metadata about a recording artist. As the metadata on the BBC programmes website gets updated to include dbpedia links, then this will also get included. We’re working with the BBC to get some of these links in place as soon as possible.

The /programmes team recently updated the website to begin exporting “segment” data. This describes what artist was being played in a specific segment of a broadcast (currently limited to Radio 2 & 6), providing links between the programmes and music datasets. Increasingly it really is just one large graph that the BBC are producing.

What Ontologies are Used?

The core of the dataset is modelled using the Programmes and Music ontologies. There is also the usual sprinkling of Dublin Core and FOAF terms to capture titles, describe people, provide images for episodes, etc. The RDF Review vocabulary has been used to model the album reviews.

The programmes website includes some content categories for genres and formats. These are modelled in the dataset as SKOS concepts. There seems to be some nascent support in the data for capturing metadata about people and places appearing in programmes. At the moment these are also modelled using SKOS.

That comprises the core data, beyond that there a number of different terms used in the dbpedia portions of the dataset. Check the dbpedia documentation for more information.

Understanding Brands, Series, Episodes

To get the most from the BBC programmes data you’ll need some understanding of some of the variations in the graph to ensure that you don’t accidentally exclude data in your queries. And if you’re a modelling geek like me its interesting in its own right! Any mistakes in the following are all my own, apologies to the BBC folk.

A Brand is a top-level concept that defines a collection of works. Its the resource that ties together Series and Episodes. Dr Who is a brand, as is the BBC News, and The Catherine Tate show. A Series, as you’d expect, is a run of Episodes, e.g. “Series 1 of The Wire”. And an Episode is similarly intuitively named.

We’re all already familiar with the basic relationships between these concepts. A Brand (“Red Dwarf”) may be related to a number of Series (“Red Dwarf Series 1″) and a Series is compromised of Episodes (“Red Dwarf, Series 1, Episode 1″). But there are a few wrinkles that are worth pointing out, as they can impact the way you write your SPARQL queries Thanks to Michael Smethurst for giving me a run-down of some of these!

Firstly a Brand may not be broken down into Series at all. The BBC News, for example, is simply a continuous stream of Episodes. Radio shows are similar.

Similarly a Series of Episodes may not necessarily be associated with a Brand. It may be a one-off run of Episodes, e.g. a short documentary series like Incredible Animal Journeys.

Some Episodes are not associated with either a Series or a Brand. E.g. films, like Lady In the Water, for example.

And there’s also the more interesting relationship that sees consists of two Series being associated with one another. For example “Waking the Dead” is divided up into Series (e.g. Series 5), which themselves contain other Series (covering a specific story line, e.g. Towers of Silence) and then individual Episodes (Part 1).

(As an aside, this is the kind of flexibility that makes RDF such a great tool for modelling real-world data. I’ve used similar approaches in the past to model bibliographic metadata throwing out hierarchies and simply connecting together chunks of content in whatever structure is best suitable)

Finally an Episode may have more than one Version. It is at the Version level that information such as the sound format or duration of the show is captured, after all there may be many different manifestations of the same episode. Versions are also associated with Broadcasts which capture the date, time and channel (“masterbrand” in the Programmes ontology) on which the programme is aired. A Version of an Episode may be broadcast several times.

Finally at the most fine-grained level, there are Timelines that describe the start and end time of a specific broadcast.

Application Ideas

During my expeditions through the Big BBC Graph (“you’re in a maze of twisty little predicates, all alike…“) I’ve come up with a few application ideas that it would be interesting to put together. I thought I’d throw these out and see if anyone wants to pick them up.

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Summary

Hopefully that provides a good overview of the BBC linked data graph that we’re now hosting in the Talis Platform. There should be sufficient pointers here, and in some of the example queries and demos we’ve put together to get you started. If not, then feel free to ask questions on the BBC Backstage mailing list, or the n2-dev mailing list or on IRC in #talis on irc.freenode.net.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

Using Twinkle to SPARQL the Platform

A few years ago I wrote Twinkle, a simple GUI interface for working with SPARQL. While its not the most polished of user interfaces and its in sore need of an update, it’s still serviceable and has been successfully used as a development tool by teams of engineers I’ve worked with in the past.

I gave a short talk on Twinkle at an Oxford SWIG meeting, so you can flick through the slides if you want a quick overview of the functionality. I also moved the code to a google code project to start the process of updating it

Twinkle has the capability to work with a range of different data sources and includes a full SPARQL client, so you can use it to work with any SPARQL endpoint that is accessible from your desktop. Out of the box Twinkle is already configured to work with the Govtrack and DbPedia endpoints, but you can easily add more by changing the configuration.

If you download and unzip the distribution into a directory you should end up with an etc/config.n3 file. This file contains all of the configuration that drives the user interface, including a section that configures remote SPARQL endpoints, e.g:


<http://dbpedia.org/sparql> a sources:Endpoint
    ; sources:defaultGraph "http://dbpedia.org"
    ; rdfs:label "DBpedia.org".

<http://www.rdfabout.com/sparql> a sources:Endpoint
    ; rdfs:label "GovTrack.us".

The above snippet configures two remote endpoints, and applies labels to them so that they appear in the Twinkle UI, under the “Remote Services” section on the left-hand menu. Because some endpoints, such as DbPedia, require to specify a default graph in the SPARQL protocol request, you can also specifiy that in the configuration if necessary.

If you have a Platform Store, or just want to access some data held in the Platform, then you can use Twinkle to perform your SPARQL queries. For example I have a store containing NASA space flight data. The SPARQL endpoint for this store is at:

http://api.talis.com/stores/space/services/sparql

So to register this in Twinkle, I can edit the configuration file and include the following snippet:


<http://api.talis.com/stores/services/sparql> a sources:Endpoint
    ; rdfs:label "NASA Space Data".

Once you’ve restarted the UI you should now be able to click on the Remote “NASA Space Data” service and open up a window into which you can start executing SPARQL queries.

If you’re new to SPARQL, or are interested in playing with the above space data, then you can look over the following slides from a recent SPARQL training session that I ran:


By rob

The slides contain a number of sample queries that should help get you started. Unfortunately some of the diagrams don’t look great in slideshare, but you should be able to download them for a closer look.

Authoring RDF data with SPARQL

Yesterday Yves Raimond and I presented a tutorial at WOD-PD where we created some turtle data and used my online semantic converter tool to convert the data to RDF/XML and POST it to the platform store we set up for the tutorial (wod-pd-sandbox).

In fact though, every SPARQL endpoint that supports CONSTRUCT is already a turtle -> rdf/xml converter. You can write Turtle with no variables in the CONSTRUCT graph, leave the WHERE graph pattern empty, and you will get back RDF/XML.

eg:

PREFIX ex: <http://example.org/>
CONSTRUCT {
  ex:Jimmy ex:eat ex:World .
}
 WHERE {}

returns

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:ex="http://example.org/" >
  <rdf:Description rdf:about="http://example.org/Jimmy">
    <ex:eat rdf:resource="http://example.org/World"/>
  </rdf:Description>
</rdf:RDF>

You can also use CONSTRUCT to create new data inferred from existing data. For instance, I wanted to add some triples about the conference, and I knew that everyone in the store with a URI in the store’s own namespace had been following the tutorial, and so was also attending the conference. So I made this query, and then POSTed the results into the store:

           PREFIX schema: <http://api.talis.com/stores/wod-pd-sandbox/items/Schema/>
	PREFIX sandbox: <http://api.talis.com/stores/wod-pd-sandbox/items/Things/>
	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
	PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
           PREFIX owl: <http://www.w3.org/2002/07/owl#>

	CONSTRUCT { 

		schema:Conference a rdfs:Class ;
		rdfs:isDefinedBy schema: ;
		rdfs:label "Conference" .

		schema:startDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "start date" .

		schema:endDate a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "end date" .

		schema:attendee a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attendee" ; owl:inverseOf schema:attended .

		schema:attended a rdf:Property ;
			rdfs:isDefinedBy schema: ;
			rdfs:label "attended"; owl:inverseOf schema:attendee .

		sandbox:WOD-PD a schema:Conference ;
		           rdfs:label "Web of Data" ;
		           schema:startDate "2008-10-22" ;
		           schema:endDate "2008-10-23" ;
					   schema:attendee ?person .
		?person schema:attended sandbox:WOD-PD .
}  WHERE
{
	?person a <http://xmlns.com/foaf/0.1/Person> .

           FILTER(REGEX(STR(?person), "sandbox/items/People/"))
}

I used PREFIX to declare a prefix for a couple of namespaces with the store’s contentbox URIs – this meant that these URIs would dereference and work as Linked Data – 303ing to their RDF descriptions. This is a really nice feature of the platform, and makes it easy to mint new URIs that will play nice on the semantic web.

You might also have noticed that there are some new properties and classes defined there in the CONSTRUCT. This isn’t absolutely ideal – there is no documentation, and the terms are unlikely to be used again – but on the other hand, the descriptions are dereferencable according to the principles of linked data, and just as persistent as the data they describe. Moreover, as Richard Cyganiak said today – if you worry about doing RDF ‘right’ to the extent that it stops you doing RDF, you’re not doing it right.