Subscribe

Archive for the 'Tutorials' Category

SPARQLing data.gov.uk: Edubase Data

Last week the Cabinet Office issued a call for Open Data Developers to sign-up to get a preview of the forthcoming UK Government public data website. The site includes a directory of existing datasets plus a growing number of datasets that have been converted to RDF and which will shortly be available as Linked Data. This data is being stored in the Talis Platform providing developers with access to SPARQL endpoints as a means to query the data; we’ll also be including search and other access mechanisms at a later date.

In this series of postings I wanted to show some example SPARQL queries that can be used to access the data. If you’re new to SPARQL then you might want to look at Lee Feigenbaum’s SPARQL by Example tutorial, or my own short slide deck that covers all the basic syntax.

The first dataset I wanted to highlight is an extract of the Edubase dataset available from the Department of Children, Schools and Families. The conversion was carried out by the team at HP Labs and has been loaded into a Talis Platform store. The public facing SPARQL endpoint is available from: http://services.data.gov.uk/education/sparql.

Here are some sample SPARQL queries you can use against the data:


#1. Select the names of schools in the Administrative District of the City of London
# Ordering results by name of the school
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> ;
}
ORDER BY ?name

Results


#2. Which schools in the BANES area have a nursery?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?name WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> ;
     sch-ont:nurseryProvision "true"^^xsd:boolean
}
ORDER BY ?name

Results


#3. Select the names and addresses of schools in the Administrative District of the City of London
# Ordering results by name of the school
# Note: we use OPTIONAL here as not every school has an address listed in the data
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?address1 ?address2 ?postcode ?town WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00AA> .

  OPTIONAL {
   ?school sch-ont:address ?address .
  ?address sch-ont:address1 ?address1 ;
      sch-ont:address2 ?address2 ;
      sch-ont:postcode ?postcode ;
      sch-ont:town ?town .
  }
}
ORDER BY ?name

Results


#4. Select the name, lowest and highest age ranges, capacity and pupil:teacher ratio
# for all schools in the Bath & North East Somerset district
# Again we use OPTIONAL to allow for missing data items.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?name ?lowage ?highage ?capacity ?ratio WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:districtAdministrative
        <http://statistics.data.gov.uk/id/local-authority-district/00HA> .
     OPTIONAL {
       ?school sch-ont:statutoryLowAge ?lowage ;
     }

     OPTIONAL {
       ?school sch-ont:statutoryHighAge ?highage ;
     }

     OPTIONAL {
       ?school sch-ont:schoolCapacity ?capacity ;
     }

     OPTIONAL {
       ?school sch-ont:pupilTeacherRatio ?ratio
     }
}
ORDER BY ?name

Results


#5. What is the uri, name, and opening date of the oldest school in the UK?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date.
}
ORDER BY ASC(?date)
LIMIT 1

Results


#6. Select the name, easting and northing for the 100 newest schools in the UK.
# Can be used to plot them on a map
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
}
ORDER BY DESC(?date)
LIMIT 100

Results


#7. Select the uri, name, easting and northing for all schools opened in 2008
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?date ?easting ?northing WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name;
     sch-ont:openDate ?date ;
     sch-ont:easting ?easting ;
     sch-ont:northing ?northing .
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}

Results


#8. Select the uri, name, and the reason for closing for all schools that are currently
# scheduled for closure. The reason is a URI from a controlled vocabulary in the ontology.
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
SELECT ?school ?name ?reason WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Open_but_proposed_to_close ;
     sch-ont:reasonEstablishmentClosed ?reason .
}

Results


#9. In which parliamentary constituencies did schools close in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ;
     sch-ont:closeDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results


#10. In which parliamentary constituencies did schools open in 2008?
prefix sch-ont:  <http://education.data.gov.uk/def/school/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?cons ?label WHERE {
  ?school a sch-ont:School;
     sch-ont:establishmentName ?name ;
     sch-ont:openDate ?date ;
     sch-ont:parliamentaryConstituency ?cons .
  ?cons rdfs:label ?label.
  FILTER (?date > "2008-01-01"^^xsd:date && ?date < "2009-01-01"^^xsd:date)
}
ORDER BY ?cons

Results

Hopefully that’s enough to get you started. If you want a bit more background on the modelling and a look at the ontology, then read this posting to the uk-government-data mailing list by Stuart Williams.

note: updated 16 Nov 2009 to reflect changes to the EduBase data. The first version of this dataset was created before the proposed guidelines for public sector URIs was published. The school ontology used in that first dataset had a URI of http://education.data.gov.uk/ontology/school# which has now been replaced with http://education.data.gov.uk/def/school/. Also the URIs for administrative districts were temporary placeholders containing the phrase “placeholder-id” in their path. These have now been updated to URIs based on the Office for National Statistics district codes, for example http://statistics.data.gov.uk/id/local-authority-district/00AA

Notes on Cross-Domain Ajax

Background

I asked for a little project I could get my teeth into, Leigh suggested something very tasty. An analytics app, along the lines of Google Analytics or the (very impressive) open source Piwik. Basically tracking things like page visits, referers, outbound clicks and so on. The difference from the existing apps being taking advantage of semweb goodness, specifically a Talis Platform store as a backend.

What this required was something that would run in the browser when someone visited a given Web page and pass on relevant data to a server which would push that data into the store. A script discretely embedded on the page of interest picks up the activity and posts it to the server-side logging system. There wasn’t really a sensible choice other than to use Javascript client-side, and to keep things reasonably portable server-side I opted for PHP. The server-side processing is relatively straightforward (although I’m not actually capturing much yet), but the browser-server comms part turned out to be a real doozy.

It’s not difficult to call a HTTP server from inside Javascript wrapped in HTML loaded in a browser. The snag is that the security model common to popular browsers blocks access to server domains other than the one that originated the page containing the Javascript. I got some code running from http://hyperdata.org that nicely delivered some basic logging of visits to pages on http://hyperdata.org (including the Wiki I have there – though it took a while to find the right template…). Problems started when I tried the same script in pages hosted under http://danny.ayers.name. Browser no likey, wrapping the server call in a try...catch block and throwing up an alert(error) always revealed Exception… “Access to restricted URI denied” code: “1012″ – this is the same origin policy. What follows are the workarounds for this. Googling the titles here will provide a variety of sample code that implements the solutions. I’ve opted for Hidden Form, it being straightforward for my purposes and standards-friendly.

Cross-domain proxy

Conceptually the easiest, this approach uses a server-side pipeline that lives on the same domain as the delivered pages containing the Javascript. It essentially echos calls from the delivering server to the remote server that does the work. This didn’t seem a good choice for the analytics app as every end-user would require such a proxy on their own server.

  • Pros: straightforward; independent of browser vagaries; spec friendly
  • Cons: needed for every host delivering pages with embedded scripts (if all the servers involved are yours, this is probably a good choice)

Tag Overload Hacks

When a typical browser hits HTML tags <script> and <img> (any others?) it will quite happily do a HTTP GET on them, irrespective of domain. There’s been a fair bit of finesse applied around the use of <script> – notably the elegant but brain-boiling JSONP (JSON with Padding) which passes around scripts padded to be non-executable and involves callbacks. Somehow. I won’t comment further on this, except to say I understood it for about 5 minutes then lost it again when I went to make a coffee. I’m told jQuery will do something similar automagically if you choose datatype: "json" and method: "get".

The <img> approach has been around seemingly forever – it’s also known as a Web Bug. Usually you have a 1×1 pixel image in the page of interest (probably inserted dynamically through DOM calls), every time the page is loaded that image’s URI gets a GET. The trick for tracking is to append the image URI with a bunch of query parameters and have your server intercept the GET call. Apparently this is how Google Analytics does its stuff.

  • Pros: good library support
  • Cons: limited to GETs; rather an ugly hack

Flash Proxy

Most people suggested this when I was asking around Twitter and the jQuery mailing list. Turns out there’s a really convenient library that does all the hard work (Google “flXHR”). But I’m afraid I prefer to give Flash a miss when there are open standards available, so I didn’t investigate.

  • Pros: easy (apparently) with library support
  • Cons: uses proprietary stuff

Hidden Form

When I first saw references to this I overlooked it – it seemed to demand an iFrame and ugly hackery. But then (largely thanks to this discussion of cross-domain Ajax) I realised it was almost certainly the best bet for the analytics app. Essentially you dynamically push a <form> into the HTML DOM with your data as input values, then call a form.submit(). Most references to this I found did involve an iFrame to receive the HTTP response – necessary if you’re doing a mashup or something, but not if you only need to POST data off to the server. In this latter circumstance you need to get the server to return a 204 No Content status code, but that’s trivial in PHP (header('HTTP/1.1 204 No Content');), otherwise the browser will try to load the target URI material.

  • Pros: supports and is very simple for POSTing to server; standards-friendly in this context
  • Cons: gets uglier if you want a response

I’ve not properly doc’d my app code yet (and the functionality is a very long way from complete, let alone tidied up), but you can find it all via my latest Wiki – there’s an example of the Javascript in test.html (just before the closing </body> tag). I’ve only tested it on Firefox so far, but I reckon there’s a good chance of the LazyWeb giving me solutions to any cross-browser issues.

Many thanks for all the helpful suggestions: from this thread on the jQuery mailing list and Twitterers @rjw @flensed @gridinoc @weblivz @JeniT @jQueryHowto.

I’d love to hear of any other solutions to cross-domain Ajax, please drop in comments, mail me or tweet me.

Presenting BBC and NASA data using Freemix and SPARQL

One of the most interesting applications I saw at the recent Semantic Technology conference was Freemix. The application, which is currently in limited beta, allows anyone to easily create customized views over data that they upload into the system. There’s also the usual networking features providing an additional social dimension to data sharing and publishing. As I understand it Zepheira have plans for expanding the range of features in all sorts of ways, including new visualisations, the ability to merge and remix data from several sources, and naturally enough a commercial version that can be deployed within the enterprise.

The core of Freemix is Simile Exhibit and a drag and drop interface for building up an Exhibit presentation over data that the user has uploaded. Data can be presented in several different ways, including simple tabular spreadsheets and in the Exhibit JSON format. Exhibit provides a number of different existing views suitable for presenting data, including lists, tables, maps, timelines, etc. As a web developer its straight-forward to build up your own Exhibits; Freemix takes this to the next level, making it trivial to build a presentation in just a few clicks, without the need to learn any markup: you just have to understand your own data.

Naturally enough I was curious to know whether Freemix could be used to build presentations of Linked Data, and specifically whether I could feed it with data from the Talis Platform. I’ve been working with the BBC data quite extensively recently, and have been compiling a space flight dataset. So I thought I’d use those as my test cases. Both of these datasets are in Platform stores, so I explored the options for extracting data using a SPARQL query in order to build a presentation in Freemix. It turns out its really easy.

Freemix supports importing JSON data from a URL, so I knew that in theory I could write a SPARQL query against a Platform store and use the SPARQL protocol request URL as the import target. As I didn’t want to extract the whole dataset, just some interesting subset for my presentation, a SPARQL CONSTRUCT query seemed like the best option. Like Exhibit, Freemix requires a relatively flat data structure — i.e. resources with properties, rather than a true directed graph. This means that within the CONSTRUCT query I would need to simplify the graph structure, removing some of the richer modelling, to re-shape the data to fit Freemix’s expectations.

Here’s a query I came up with for my NASA data:


PREFIX rdfs:
PREFIX dc:
PREFIX space:
PREFIX xsd:
PREFIX foaf: 

CONSTRUCT {
?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launched ?launched;
dc:description ?description;
space:discipline ?label.
}
WHERE {
?launch space:launched ?launched.

?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launch ?launch;
dc:description ?description;
space:discipline ?discipline.

?discipline rdfs:label ?label.

FILTER (?launched > "2005-01-01"^^xsd:date)
}

The query finds all spacecraft launched since 2005, extracting the name, agency, mass, etc. The labels of the disciplines (subject categories) and the launch dates which are originally associated with separate resources in the underlying graph, are re-presented here as properties of the spacecraft itself. Not ideal in a modelling or data interchange perspective, but a reasonable trade-off for shaping data for presentation purposes.

So far so good. The Talis Platform supports a range of output options from CONSTRUCT queries including both RDF/XML and RDF/JSON. Unfortunately Freemix doesn't support RDF/JSON as an input option although this would make a nice addition to the range of import options. In order to convert from the RDF/XML to the Exhibit/JSON format for Freemix I used the Talis Morph service. Morph is a simple service that provides a number of options for converting between semantic web formats. RDF/XML to Exhibit/JSON is one of those options, so all I needed to do was pipe the original SPARQL query URL through the morph service to get my final import target for Freemix.

You can view the imported data on my Freemix homepage. And here's a presentation of that same data. As you can see the presentation provides a list and table views, piecharts that break down launches by agency and discipline, and also a timeline view of the launches. This was incredibly quick to put together.

I tried the same approach with some BBC data. So here's a simple Dr Who episode guide as a Freemix. The presentation options are a little more limited here, partly because there aren't as many natural facets to the BBC data, but also because Freemix doesn't (yet?) offer the ability to, e.g. create a coverflow presentation of images, or a tag cloud over blocks of text. The ability to mark fields as numbers and sort tables by multiple fields would also be useful. Having said that, trying searching for "Rose" in the search box to see which episodes descriptions mention her; note that the series facet on the left also automatically updates.

As you can see from the SPARQL query, some massaging of the graph structure was required to include series titles against each episodes.


PREFIX foaf:
PREFIX rdfs:
PREFIX po:
PREFIX dc:
PREFIX freemix: 

CONSTRUCT {

?episode a po:Episode;
foaf:depiction ?depiction;
freemix:seriesTitle ?seriestitle;
dc:title ?title;

po:position ?position;
po:short_synopsis ?syn.
}
WHERE
{
po:series ?series.

?series dc:title ?seriestitle;
po:episode ?episode.

?episode a po:Episode;
foaf:depiction ?depiction;
dc:title ?title;
po:position ?position;
po:short_synopsis ?syn.

}

My only other issue with Freemix is the live-ness of the data. Ideally instead of having to import data directly into the system, it should instead be fetched from source either on demand or on a regular basis. I suspect this is the kind of feature that will end up in a commercial version of the product.

Overall though I was quite pleased with how easy it was to create these kinds of presentations. I'm convinced that for Linked Data to truly hit the mainstream we need simple tools like Freemix that let all of us easily compile and create custom presentations of data. Obviously, we also need to be able to easily select the data that we want to display, and very few people will want to bother with SPARQL queries. So I think there is some interesting work to be done to create SPARQL query builders that tie into browsers, e.g. so I can select the data facets I'm interested in as I browse and then choose to represent those facets in different ways.

Data Migration using SPARQL and Changesets

A tagline sometimes used for RDF is “self-describing data”. Sometimes though, you make your data describe itself badly; perhaps you’ve used a vocabulary term that has since been deprecated, or perhaps you’ve found a term which is more widely supported, or more appropriate to the data; maybe there was a typo in the script you generated your triples with. At any rate, it’s pretty common to have to fix your data, and if you have a live application with fresh ‘bad’ triples being created all the time, and a lot of bad triples to fix anyway, this can get tricky.

We’re having to do this in a project at the moment, and this is the method Nad and I came up with. We separated out adding good triples and removing bad triples into separate stages because our application will continue to function the same with both good and bad triples, but, once we rollout the code that expects the good triples, we need the good triples, whereas the bad triples can be removed at our leisure.

Adding New Good Triples

So, say for example, one of the things we want to fix is using dcterms:creator instead of dc:creator. We can get the good triples by querying for:

    CONSTRUCT {
    # good
                ?s <http://purl.org/dc/terms/creator> ?o
    } WHERE {
    # bad
                ?s <http://purl.org/dc/elements/1.1/creator> ?o
    }

And posting that back into the store. (First make sure that your application won’t do anything too weird if you add in these triples without changing any code).

If there are a lot of triples in the store, you may not be able to retrieve and post them all at once. To scale to large numbers of triples, just page through the results at, say, 1000 triples at a time by adding LIMIT 1000 OFFSET 0 and incrementing the OFFSET by the LIMIT (1000) until you don’t get triples back anymore.

Wrap this little procedure up in a script because you’ll need to run it again.

Deploy Code

As soon as you have finished adding the new /good/ triples, deploy your new code that uses dcterms:creator instead of dc:creator.

Now run the add good triples script again. This is because, while you were deploying the code, users may have been plugging away at your app, happily creating more bad old triples. Running the script again will add good triples for any of these bad triples that have been created meantime. And because you’ve now deployed your code changes, the application won’t create any more bad triples.

All we have left to do is get rid of the bad old triples. With any luck (and a bit of foresight, and testing), your application will function perfectly well with both bad and good triples in the store, so we can take our time a bit getting rid of the bad triples.

Removing Bad Old Triples

We’ll write a SPARQL query to give us back the triples we want to remove, and then we’ll create a Changeset to remove them:

    CONSTRUCT {
    # bad
             ?s <http://purl.org/dc/elements/1.1/creator> ?o
    } WHERE {

    # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
    # good
            ?s <http://purl.org/dc/terms/creator> ?o

    }

(you should apply a LIMIT, but, so long as you are waiting for each changeset batch to succeed before sending the next one, you don’t need to page – just get back the first 1000 until you don’t get anything back. It’s worth remembering that the number of triples in a changeset document will be about ten-fold the number of triples you are removing, so you may need to make the limit a bit smaller than before).

An important point on the platform’s Changesets API: if you send more than 14 changesets in a batch (ie, in the same document), they will be performed asynchronously and you should get back an HTTP 202 Accepted status code. A potential problem is that, if you are trying to remove a statement that doesn’t exist then all the changes in that batch will fail, but you will still get back a 202 Accepted (because the platform hasn’t tried processing them yet). You need a way of knowing if the batch has failed or not. One way to do this is to include in your batch of changes, the addition of a triple you can then poll the store for to see if it exists or not.

If you’re using PHP, you can use Moriarty to create your changesets:

    
#php
define('STORE_URI', 'http://api.talis.com/stores/sandbox1');
$markerURI =  STORE_URI.'/items/'.time();
$time = time();

$rdfToAdd = " <{$markerURI}> <http://purl.org/dc/terms/created> \"{$time}\" . ";

$args = array(
    'before' => $rdfToRemove, // got this from the CONSTRUCT described above
    'after' => $rdfToAdd, // this is the marker triple
);
$cs = new Changeset($args);
$store = new Store(STORE_URI); // you will probably need to add your login credentials - see the moriarty docs
$response = $store->get_metabox()->apply_changeset($cs);

if(!$response->is_success()){
    //log error, and stop
    log_error("Changeset failed: ".$response->status_code ."\n " . $response->body ." \n  Changeset: \n" . $cs->to_rdfxml());
    break;
}

    

At this point, you may also want to poll the store for the existence of your marker triple to see if the batch has been processed. Since we minted the marker URI in the store’s URI space, we can just try to dereference it; as soon as we get back a 200 or 303 response, we can move on, but if we still get back a 404 after say 10 seconds, the changeset has probably failed and we need to log that and investigate.

If everything goes OK, you can then make double sure you’ve got rid of all the bad triples by running a quick ASK query against your store’s SPARQL service.

    
        ASK {
            # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
        }
    

And if you get back FALSE, you’re finished.

Well done!

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

Building a Custom Search Index

The Platform is more than just a triple store with a SPARQL interface. It provides a number of other services which are useful for application developers. The most useful of these is the built-in search engine. Each Platform Store has its own search engine that can be used to perform queries over the hosted metadata. So as well as having the option to query your data using the SPARQL query language, you also have the ability to do simple queries over the data with results being returned as RSS 1.0 (with the OpenSearch extensions). This is a nice feature as sometimes you don’t need the full power of SPARQL and for some use cases a more specialized text indexing system is a better option.

The Platform API allows you to configure the system to build a full-text index over any or all of the RDF literals in your stored data. The exception to this the RDF type predicate, this is the only predicate that will have resource values indexed, making it possible for you to construct a search index and queries that can be used to find matches in specific types of RDF resource.

The remainder of this post shows how to configure the Platform to build a custom search index, with example Ruby code using Pho.

Its common in search engine syntax to use a simple friendly name to identify a specific field that you want to search. For example in a Google search you can use “intitle:Blah” to search for the text “blah” only in the HTML title element of indexed pages. The Platform uses a similar mechanism to allow you to map any RDF property URI to a short friendly name suitable for submitting in a search query.

The complete set of these mappings are referred to as a FieldPredicateMap. The mapping is specific to each store, allowing different stores to have their own mappings. The Platform API exposes these mappings allowing you to retrieve and update the mappings yourself.

It is the presence of a mapping of a property URI to a friendly name that triggers the Platform to start indexing the literal values associated with that property. To put this another way: all you have to do to start indexing your literals is define a mapping. Its a simple as that.

Once a mapping is in place, whenever you submit some RDF/XML to your store, the Platform will automatically index all of the mapped triples. The indexing is done asychronously so there might be a short delay between the deposit of new content and the indexes being updated. Standard stuff.

The Pho Ruby API for the Platform provides programmatic access to this functionality, allowing you to script up the management and creation of the FieldPredicateMap. See the rdocs for the FieldPredicateMap class for details.

Here’s an example Ruby script that illustrates how to manage mappings.

To run the script you’ll first need to fill in the name of your store and your admin username and password. You’ll also need to make sure you’ve installed Pho: gem install pho should do the necessary.

The script does several things. Once a store object has been created, the script creates two new mappings. One for the FOAF name predicate, and one for RDF type:


#create the mappings we want
name = Pho::FieldPredicateMap.create_mapping(store, "http://xmlns.com/foaf/0.1/name", "name")

type = Pho::FieldPredicateMap.create_mapping(store, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "type")

The create_mapping method allows you to quickly generate a mapping suitable for adding to a specific store. In order to fetch the current list of mappings the script then does:


#read the existing mappings
mappings = Pho::FieldPredicateMap.read_from_store(store)

#remove anything for this uri
mappings.remove_by_uri("http://xmlns.com/foaf/0.1/name")
mappings.remove_by_uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

#append the new field-name mappings
mappings << name
mappings << type

The read_from_store method does the actual work, doing a GET request to the Platform to retrieve the mappings as JSON, which are then parsed into some useful Ruby objects. The remaining lines then add the newly created mappings to the current collection, after first ensuring that any previous mappings for those URIs have been removed. At this stage we’ve updated our local copy of the mappings but have not yet saved them back to the Platform.

Storing the updated mappings in the Platform is then just a matter of calling the upload method on the mapping object. This serializes the list of mappings as RDF/XML and then PUTs them back to the store. This will overwrite any of the current configuration with the updated copy we’ve got locally: this is one reason why we fetch the current copy before making the changes, to ensure the rest of the configuration is preserved.


resp = mappings.upload(store)
if resp.status_code != 200
  abort("Failed to upload mappings!")
end

The upload method, like many of the lower-level method calls in the Pho library return an HTTP::Message object that you can inspect to determine if the Platform request was successul.

The remaining lines in the sample script simply upload some test data to your store: astronauts.rdf contains a short list of a few astronauts, modelled as simple foaf:Person instances with a foaf:name property. This allows you to test out your newly created search index.

You can now construct item searches with syntax like “name:Buzz” to search for the name Buzz in any foaf:name predicate. Or you can find all foaf:Person instances by performing a search for:

type:"http://xmlns.com/foaf/0.1/Person"

Note that you have to quote the predicate URI. And you can obviously combine those to find only foaf:Person resources with a specific foaf:name.

I’ve run the script against the n2-examples store, so you can use the item search form to test it out. Or just click here to list all the people.

If you peek at the source of the returned RSS feed you’ll find that the essential metadata for each result — in this case the foaf:name property and the rdf:type — is automatically included. Incidentally if you have a FieldPredicateMapping defined with a property name of title then this will automatically be used as the title for the RSS item, allowing you some minor degree of control over the feed structure if you wanted to make it more human-readable.

The Platform provides you with a few more options for managing your search indexes than I’ve covered here. For example the FieldPredicateMap can also be used to associate an Analyzer with the field allowing you to control the indexing rules. You can also control the relevance ranking of the search results through the use of a Query Profile (which is also exposed through an API, and is manageable using Pho). The query profile lets you associate a weighting with a field, so that when a user performs a search without indicating which field they want to search (thereby searching all fields), then the Platform will alter the relevance ranking of the results to suit your preferences.

That concludes our look at the basic steps involved in building a custom search index over the Platform. While the Pho library provides some useful support its worth remembering that its simply a thin veneer over several HTTP operations so achieving the same effects in another language — or even from the console using plain old curl should be easy enough. Hopefully the examples have also illustrated the simplicity of working with the Platform to create some quite powerful features and, importantly, that developing against the Platform doesn’t require your to be a SPARQL wizard: there are other ways to get data out of the system, but the power of SPARQL is there when you need it.

Any questions, then leave a comment and I’ll try to answer them.