Subscribe

Archive for the 'Tips and Tricks' Category

Configuring Guice Dependencies Post-Deployment

In a number of our projects, the Platform engineering team use Guice as a dependency injection framework. The benefits of DI with regard to increasing modularity, lowering coupling and facilitating reuse are well documented, and a killer feature for us is the vast improvement of testability. One of the reasons we like Guice, is that all of your dependency wiring is done in code and so is checked by the compiler. Guice also seems to strike just the right balance between features and bloat, the core library makes it easy to do the things you really need, without including lots of stuff you don’t want. There’s also an active community developing extensions and additions to integrate or adapt Guice for specific uses.

Sometimes, we do want the ability to control the composition of an app at deploy time, which for us means specifying which combination of Guice Modules to configure our Injector with. Ordinarily, the main method (or something called early on in the application lifecycle) would contain some code to initialise the Injector with a list of Modules. Like so:

Injector injector = Guice.createInjector(new NetworkModule(),
                                         new SequencingModule(),
                                         new MySQLModule()
                                         new JMSModule());
SomeThing thing = injector.getInstance(SomeThing.class);

Our use case was this, we wanted to deploy the same distribution of an application to multiple places and configure which implementations of various internal services were used on each environment. So in the example above, we wanted to be able to choose between the bindings specified in MySQLModule and PostgresModule after deployment. Initially, it didn’t seem that there was an existing solution, until we ran into java.util.ServiceLoader. This enables multiple concrete implementations of abstract services (i.e. interfaces/abstract classes) to be specified at runtime using a simple descriptor file on the classpath (the javadocs have a much fuller explanation). So, in this case the abstract service that we want to load is defined by com.google.inject.Module and the concrete implementations are the specific combination of modules we want to use to configure our app. The hardcoded Injector bootstrapping is replaced with this one liner:

Injector injector = Guice.createInjector(ServiceLoader.load(Module.class));

The spec of which modules to load is contained in a classpath resource named META-INF/services/com.google.inject.Module and is just a simple list of full qualified class names

com.talis.network.NetworkModule
com.talis.sequence.SequencingModule
com.talis.db.mysql.MySQLModule
com.talis.jms.JMSModule

It’s possible to provide the service configuration file over HTTP by specifying remote URLs on the classpath, but at the moment we’re controlling which config gets deployed where using our regular deployment tool, Puppet.

Automatically Creating Inverse Changesets and When They Don’t Behave as Expected

The Talis Platform uses changesets as a mechanism for updating RDF. As the configuration of the Platform is itself stored as RDF, we also use changesets to modify its configuration. This can be as part of a release or to make requested changes to a customer’s store.

I recently needed to apply a large number of changesets to the Platform configuration. But before applying them, I wanted to create another set of changesets which would, if necessary, reverse all the changes – I wanted to be able to rollback if anything went wrong.

So my changesets looked something like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
   <cs:ChangeSet rdf:about="http://example.com/changesets#change-1">
    <cs:subjectOfChange rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/old"/>
      </rdf:Statement>
    </cs:removal>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/new"/>
      </rdf:Statement>
    </cs:addition>
  </cs:ChangeSet>
</rdf:RDF>

This changeset can be reversed by changing the removals to additions and changing the additions to removals. This is easy to achieve with sed:

for f in changesetdirectory/* ; do
  sed -e 's/cs:addition/TOBEAREMOVAL/' -e 's/cs:removal/TOBEANADDITION/' \
    -e 's/TOBEAREMOVAL/cs:removal/'  -e 's/TOBEANADDITION/cs:additon/' $f > rollback/$f
done

The above script creates an inverse of every changeset in the specified changesetdirectory and places them in the rollback directory. The inverse of the example changeset above is created as below:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
   <cs:ChangeSet rdf:about="http://example.com/changesets#change-1">
    <cs:subjectOfChange rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/old"/>
      </rdf:Statement>
    </cs:addition>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/new"/>
      </rdf:Statement>
    </cs:removal>
  </cs:ChangeSet>
</rdf:RDF>

So the original changeset removes the triple:

http://api.talis.com/stores/mystore/exampleconfig 

http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty

http://api.talis.com/stores/mystore/exampleconfig/old

and replaces it with:

http://api.talis.com/stores/mystore/exampleconfig 

http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty

http://api.talis.com/stores/mystore/exampleconfig/new

The inverse changeset removes the triple:

http://api.talis.com/stores/mystore/exampleconfig 

http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty

http://api.talis.com/stores/mystore/exampleconfig/new

and replaces the original:

http://api.talis.com/stores/mystore/exampleconfig 

http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty

http://api.talis.com/stores/mystore/exampleconfig/old

Using this technique, I successfully created inverse changesets which, if I had needed to, would have rolled back the changes to the configuration.

However, there is a caveat. The set semantics of a triplestore can be a gotcha.

Suppose the following triple already exists:

http://api.talis.com/stores/mystore/exampleconfig 

http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty

http://api.talis.com/stores/mystore/exampleconfig/alreadyexists

The following changeset could be applied:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
   <cs:ChangeSet rdf:about="http://example.com/changesets#change-1">
    <cs:subjectOfChange rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/alreadyexists"/>
      </rdf:Statement>
    </cs:addition>
  </cs:ChangeSet>
</rdf:RDF>

This changeset is accepted but doesn’t actually modify the triples as the triple it adds already existed. Creating an inverse of this changeset gives us:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
   <cs:ChangeSet rdf:about="http://example.com/changesets#change-1">
    <cs:subjectOfChange rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://api.talis.com/stores/mystore/exampleconfig"/>
        <rdf:predicate rdf:resource="http://schemas.talis.com/2006/bigfoot/configuration#exampleproperty"/>
        <rdf:object rdf:resource="http://api.talis.com/stores/mystore/exampleconfig/alreadyexists"/>
      </rdf:Statement>
    </cs:removal>
  </cs:ChangeSet>
</rdf:RDF>

However, applying the inverse changeset removes the triple. As the triple existed before applying the first changeset the inverse of the changeset did not have the result we were looking for. It ended up deleting the triple which existed before we started.

So creating inverse changesets in this way can be useful, but only when you know with certainty that any triples added in the original changeset did not already exist.

SPARQL Hacks: moving query logic into data

There are too many terms that mean the same thing sometimes. Take labels. rdfs:label is perhaps the most obvious choice if you want to label something in RDF, but there are a whole bunch of semantically equivalent predicates in high usage for doing the same thing. For a while, it seems, it was common practice for every vocabulary to define their own equivalent – though very few bother to rdfs:subPropertyOf rdfs:label (and some predate rdfs:label), so even if you can do some reasoning in your query engine, this might not help you much. So when you want to get the label for something, but you don’t know which predicate the data uses, you might end up doing something like this:


construct { ?s rdfs:label ?l }
where
{
?s ?p ?o
optional
{ ?s rdfs:label ?l }
optional
{ ?s foaf:name ?l }
optional
{ ?s sioc:name ?l }
optional
{ ?s dc:title ?l }
optional
{ ?s dcterms:title ?l }
}

Nasty. And maybe later you find another label predicate in the data somewhere and have to go modify your queries.

But, if I add these triples to my store:


<#a> rdfapp:labelPredicate dc:title, rdfs:label, dcterms:title foaf:name, sioc:name .

I can instead do:


prefix rdfapp: <http://kwijibo.talis.com/vocabs/rdfapp#>
construct { ?s rdfs:label ?l }
where
{
<#a> rdfapp:labelPredicate ?labelPredicate .
?s ?labelPredicate ?l .
}

Notes on Cross-Domain Ajax

Background

I asked for a little project I could get my teeth into, Leigh suggested something very tasty. An analytics app, along the lines of Google Analytics or the (very impressive) open source Piwik. Basically tracking things like page visits, referers, outbound clicks and so on. The difference from the existing apps being taking advantage of semweb goodness, specifically a Talis Platform store as a backend.

What this required was something that would run in the browser when someone visited a given Web page and pass on relevant data to a server which would push that data into the store. A script discretely embedded on the page of interest picks up the activity and posts it to the server-side logging system. There wasn’t really a sensible choice other than to use Javascript client-side, and to keep things reasonably portable server-side I opted for PHP. The server-side processing is relatively straightforward (although I’m not actually capturing much yet), but the browser-server comms part turned out to be a real doozy.

It’s not difficult to call a HTTP server from inside Javascript wrapped in HTML loaded in a browser. The snag is that the security model common to popular browsers blocks access to server domains other than the one that originated the page containing the Javascript. I got some code running from http://hyperdata.org that nicely delivered some basic logging of visits to pages on http://hyperdata.org (including the Wiki I have there – though it took a while to find the right template…). Problems started when I tried the same script in pages hosted under http://danny.ayers.name. Browser no likey, wrapping the server call in a try...catch block and throwing up an alert(error) always revealed Exception… “Access to restricted URI denied” code: “1012″ – this is the same origin policy. What follows are the workarounds for this. Googling the titles here will provide a variety of sample code that implements the solutions. I’ve opted for Hidden Form, it being straightforward for my purposes and standards-friendly.

Cross-domain proxy

Conceptually the easiest, this approach uses a server-side pipeline that lives on the same domain as the delivered pages containing the Javascript. It essentially echos calls from the delivering server to the remote server that does the work. This didn’t seem a good choice for the analytics app as every end-user would require such a proxy on their own server.

  • Pros: straightforward; independent of browser vagaries; spec friendly
  • Cons: needed for every host delivering pages with embedded scripts (if all the servers involved are yours, this is probably a good choice)

Tag Overload Hacks

When a typical browser hits HTML tags <script> and <img> (any others?) it will quite happily do a HTTP GET on them, irrespective of domain. There’s been a fair bit of finesse applied around the use of <script> – notably the elegant but brain-boiling JSONP (JSON with Padding) which passes around scripts padded to be non-executable and involves callbacks. Somehow. I won’t comment further on this, except to say I understood it for about 5 minutes then lost it again when I went to make a coffee. I’m told jQuery will do something similar automagically if you choose datatype: "json" and method: "get".

The <img> approach has been around seemingly forever – it’s also known as a Web Bug. Usually you have a 1×1 pixel image in the page of interest (probably inserted dynamically through DOM calls), every time the page is loaded that image’s URI gets a GET. The trick for tracking is to append the image URI with a bunch of query parameters and have your server intercept the GET call. Apparently this is how Google Analytics does its stuff.

  • Pros: good library support
  • Cons: limited to GETs; rather an ugly hack

Flash Proxy

Most people suggested this when I was asking around Twitter and the jQuery mailing list. Turns out there’s a really convenient library that does all the hard work (Google “flXHR”). But I’m afraid I prefer to give Flash a miss when there are open standards available, so I didn’t investigate.

  • Pros: easy (apparently) with library support
  • Cons: uses proprietary stuff

Hidden Form

When I first saw references to this I overlooked it – it seemed to demand an iFrame and ugly hackery. But then (largely thanks to this discussion of cross-domain Ajax) I realised it was almost certainly the best bet for the analytics app. Essentially you dynamically push a <form> into the HTML DOM with your data as input values, then call a form.submit(). Most references to this I found did involve an iFrame to receive the HTTP response – necessary if you’re doing a mashup or something, but not if you only need to POST data off to the server. In this latter circumstance you need to get the server to return a 204 No Content status code, but that’s trivial in PHP (header('HTTP/1.1 204 No Content');), otherwise the browser will try to load the target URI material.

  • Pros: supports and is very simple for POSTing to server; standards-friendly in this context
  • Cons: gets uglier if you want a response

I’ve not properly doc’d my app code yet (and the functionality is a very long way from complete, let alone tidied up), but you can find it all via my latest Wiki – there’s an example of the Javascript in test.html (just before the closing </body> tag). I’ve only tested it on Firefox so far, but I reckon there’s a good chance of the LazyWeb giving me solutions to any cross-browser issues.

Many thanks for all the helpful suggestions: from this thread on the jQuery mailing list and Twitterers @rjw @flensed @gridinoc @weblivz @JeniT @jQueryHowto.

I’d love to hear of any other solutions to cross-domain Ajax, please drop in comments, mail me or tweet me.

Visualising BBC Programme Categories

Whilst I was exploring the BBC programmes data looking for possible demonstration applications I thought it might be interesting to try and create a visualisation of the relationships between different categories of BBC programmes The BBC datasets use SKOS as a categorization scheme, with separate taxonomies for formats (e.g. documentaries, animation, etc) and genres (e.g. childrens programmes, science fiction, etc). If you poke around a little, you can also see a nascent category system for places and people, although there doesn’t seem to be much data there at present (and what is there seems to change regularly).

For my purposes, the genre classifications looked most interesting. Episodes are associated with their genre category via the po:category property. As I was interested in finding relationships between genres, what I was looking for was a way to relate together individual categories, other than by the obvious super/sub-category relationship.

It occured to me that if two categories were associated with the same episode, then this could be viewed as a declaration of some implicit relationship between the categories. Extracting this in SPARQL is straight-forward, as we just need to match episodes that have more than one category:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

In the above SPARQL query we match any episode that has at least two categories (because we use two po:category patterns), and where those categories are different (in the FILTER). This excludes the unwanted result where the ?category and ?related variables are bound to the same value. I didn’t bother with pruning out duplicates as this could easily be done on the client-side.

In order to visualise the results, I decided to use MooWheel. This provides a simple Javascript visualisation toolkit for presenting connections between a set of resources. MooWheel can be configured using a JSON data structure, so generating a a MooWheel visualisation from a SPARQL query is relatively straight-forward: the query results can be retrieved as SPARQL/JSON which can then be massaged into the appropriate JSON structure to generate the MooWheel visualisation. Check out the source code of the demonstration for sample code to do this (look at the success callback).

My first attempt at a visualisation simple executed the above query across the entire BBC dataset. This generated a huge wheel of connections between the categories, but ultimately the visualisation wasn’t that useful. So I decided to refine the visualisation to generate separate category wheels for each of the main BBC TV channels. This involved refining the SPARQL query to include an extra triple pattern to limit Episodes to just those associated with a specific channel (po:masterbrand). The following revised query restricts results to BBC 1:


SELECT ?categoryLabel ?relatedLabel WHERE
{
  ?episode a po:Episode;
    po:masterbrand ;
    po:category ?category;
    po:category ?related. 

  ?category a po:Genre;
    rdfs:label ?categoryLabel. 

  ?related a po:Genre;
    rdfs:label ?relatedLabel. 

  FILTER (?category != ?related)
}
ORDER BY ?categoryLabel

The results of this visualisation is much more interesting.

Each of the BBC channels has a different range of programming and this emphasis is really clear in the visualisation. Compare for example BBC 1 and BBC 3, or either with BBC 4. For those of us in the UK who have already internalised this, there may not be a great deal of new information here, but its nice to see how this feature of the dataset can be easily surfaced with very little effort. There’s more analysis that could be done here though, particularly if the BBC open up their programme archives. For example, how do the range of programme categories for a channel change over time? Which programmes actually link the different categories together? Could other visualisations provide more insight into the programming than a simple relationship wheel? For example, could a treemap style visualisation give some indication of the amount of schedule time devoted to a particular category of programme?

Why not see what you can come up with?

Presenting BBC and NASA data using Freemix and SPARQL

One of the most interesting applications I saw at the recent Semantic Technology conference was Freemix. The application, which is currently in limited beta, allows anyone to easily create customized views over data that they upload into the system. There’s also the usual networking features providing an additional social dimension to data sharing and publishing. As I understand it Zepheira have plans for expanding the range of features in all sorts of ways, including new visualisations, the ability to merge and remix data from several sources, and naturally enough a commercial version that can be deployed within the enterprise.

The core of Freemix is Simile Exhibit and a drag and drop interface for building up an Exhibit presentation over data that the user has uploaded. Data can be presented in several different ways, including simple tabular spreadsheets and in the Exhibit JSON format. Exhibit provides a number of different existing views suitable for presenting data, including lists, tables, maps, timelines, etc. As a web developer its straight-forward to build up your own Exhibits; Freemix takes this to the next level, making it trivial to build a presentation in just a few clicks, without the need to learn any markup: you just have to understand your own data.

Naturally enough I was curious to know whether Freemix could be used to build presentations of Linked Data, and specifically whether I could feed it with data from the Talis Platform. I’ve been working with the BBC data quite extensively recently, and have been compiling a space flight dataset. So I thought I’d use those as my test cases. Both of these datasets are in Platform stores, so I explored the options for extracting data using a SPARQL query in order to build a presentation in Freemix. It turns out its really easy.

Freemix supports importing JSON data from a URL, so I knew that in theory I could write a SPARQL query against a Platform store and use the SPARQL protocol request URL as the import target. As I didn’t want to extract the whole dataset, just some interesting subset for my presentation, a SPARQL CONSTRUCT query seemed like the best option. Like Exhibit, Freemix requires a relatively flat data structure — i.e. resources with properties, rather than a true directed graph. This means that within the CONSTRUCT query I would need to simplify the graph structure, removing some of the richer modelling, to re-shape the data to fit Freemix’s expectations.

Here’s a query I came up with for my NASA data:


PREFIX rdfs:
PREFIX dc:
PREFIX space:
PREFIX xsd:
PREFIX foaf: 

CONSTRUCT {
?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launched ?launched;
dc:description ?description;
space:discipline ?label.
}
WHERE {
?launch space:launched ?launched.

?spacecraft foaf:name ?name;
space:agency ?agency;
space:mass ?mass;
foaf:homepage ?homepage;
space:launch ?launch;
dc:description ?description;
space:discipline ?discipline.

?discipline rdfs:label ?label.

FILTER (?launched > "2005-01-01"^^xsd:date)
}

The query finds all spacecraft launched since 2005, extracting the name, agency, mass, etc. The labels of the disciplines (subject categories) and the launch dates which are originally associated with separate resources in the underlying graph, are re-presented here as properties of the spacecraft itself. Not ideal in a modelling or data interchange perspective, but a reasonable trade-off for shaping data for presentation purposes.

So far so good. The Talis Platform supports a range of output options from CONSTRUCT queries including both RDF/XML and RDF/JSON. Unfortunately Freemix doesn't support RDF/JSON as an input option although this would make a nice addition to the range of import options. In order to convert from the RDF/XML to the Exhibit/JSON format for Freemix I used the Talis Morph service. Morph is a simple service that provides a number of options for converting between semantic web formats. RDF/XML to Exhibit/JSON is one of those options, so all I needed to do was pipe the original SPARQL query URL through the morph service to get my final import target for Freemix.

You can view the imported data on my Freemix homepage. And here's a presentation of that same data. As you can see the presentation provides a list and table views, piecharts that break down launches by agency and discipline, and also a timeline view of the launches. This was incredibly quick to put together.

I tried the same approach with some BBC data. So here's a simple Dr Who episode guide as a Freemix. The presentation options are a little more limited here, partly because there aren't as many natural facets to the BBC data, but also because Freemix doesn't (yet?) offer the ability to, e.g. create a coverflow presentation of images, or a tag cloud over blocks of text. The ability to mark fields as numbers and sort tables by multiple fields would also be useful. Having said that, trying searching for "Rose" in the search box to see which episodes descriptions mention her; note that the series facet on the left also automatically updates.

As you can see from the SPARQL query, some massaging of the graph structure was required to include series titles against each episodes.


PREFIX foaf:
PREFIX rdfs:
PREFIX po:
PREFIX dc:
PREFIX freemix: 

CONSTRUCT {

?episode a po:Episode;
foaf:depiction ?depiction;
freemix:seriesTitle ?seriestitle;
dc:title ?title;

po:position ?position;
po:short_synopsis ?syn.
}
WHERE
{
po:series ?series.

?series dc:title ?seriestitle;
po:episode ?episode.

?episode a po:Episode;
foaf:depiction ?depiction;
dc:title ?title;
po:position ?position;
po:short_synopsis ?syn.

}

My only other issue with Freemix is the live-ness of the data. Ideally instead of having to import data directly into the system, it should instead be fetched from source either on demand or on a regular basis. I suspect this is the kind of feature that will end up in a commercial version of the product.

Overall though I was quite pleased with how easy it was to create these kinds of presentations. I'm convinced that for Linked Data to truly hit the mainstream we need simple tools like Freemix that let all of us easily compile and create custom presentations of data. Obviously, we also need to be able to easily select the data that we want to display, and very few people will want to bother with SPARQL queries. So I think there is some interesting work to be done to create SPARQL query builders that tie into browsers, e.g. so I can select the data facets I'm interested in as I browse and then choose to represent those facets in different ways.

Data Migration using SPARQL and Changesets

A tagline sometimes used for RDF is “self-describing data”. Sometimes though, you make your data describe itself badly; perhaps you’ve used a vocabulary term that has since been deprecated, or perhaps you’ve found a term which is more widely supported, or more appropriate to the data; maybe there was a typo in the script you generated your triples with. At any rate, it’s pretty common to have to fix your data, and if you have a live application with fresh ‘bad’ triples being created all the time, and a lot of bad triples to fix anyway, this can get tricky.

We’re having to do this in a project at the moment, and this is the method Nad and I came up with. We separated out adding good triples and removing bad triples into separate stages because our application will continue to function the same with both good and bad triples, but, once we rollout the code that expects the good triples, we need the good triples, whereas the bad triples can be removed at our leisure.

Adding New Good Triples

So, say for example, one of the things we want to fix is using dcterms:creator instead of dc:creator. We can get the good triples by querying for:

    CONSTRUCT {
    # good
                ?s <http://purl.org/dc/terms/creator> ?o
    } WHERE {
    # bad
                ?s <http://purl.org/dc/elements/1.1/creator> ?o
    }

And posting that back into the store. (First make sure that your application won’t do anything too weird if you add in these triples without changing any code).

If there are a lot of triples in the store, you may not be able to retrieve and post them all at once. To scale to large numbers of triples, just page through the results at, say, 1000 triples at a time by adding LIMIT 1000 OFFSET 0 and incrementing the OFFSET by the LIMIT (1000) until you don’t get triples back anymore.

Wrap this little procedure up in a script because you’ll need to run it again.

Deploy Code

As soon as you have finished adding the new /good/ triples, deploy your new code that uses dcterms:creator instead of dc:creator.

Now run the add good triples script again. This is because, while you were deploying the code, users may have been plugging away at your app, happily creating more bad old triples. Running the script again will add good triples for any of these bad triples that have been created meantime. And because you’ve now deployed your code changes, the application won’t create any more bad triples.

All we have left to do is get rid of the bad old triples. With any luck (and a bit of foresight, and testing), your application will function perfectly well with both bad and good triples in the store, so we can take our time a bit getting rid of the bad triples.

Removing Bad Old Triples

We’ll write a SPARQL query to give us back the triples we want to remove, and then we’ll create a Changeset to remove them:

    CONSTRUCT {
    # bad
             ?s <http://purl.org/dc/elements/1.1/creator> ?o
    } WHERE {

    # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
    # good
            ?s <http://purl.org/dc/terms/creator> ?o

    }

(you should apply a LIMIT, but, so long as you are waiting for each changeset batch to succeed before sending the next one, you don’t need to page – just get back the first 1000 until you don’t get anything back. It’s worth remembering that the number of triples in a changeset document will be about ten-fold the number of triples you are removing, so you may need to make the limit a bit smaller than before).

An important point on the platform’s Changesets API: if you send more than 14 changesets in a batch (ie, in the same document), they will be performed asynchronously and you should get back an HTTP 202 Accepted status code. A potential problem is that, if you are trying to remove a statement that doesn’t exist then all the changes in that batch will fail, but you will still get back a 202 Accepted (because the platform hasn’t tried processing them yet). You need a way of knowing if the batch has failed or not. One way to do this is to include in your batch of changes, the addition of a triple you can then poll the store for to see if it exists or not.

If you’re using PHP, you can use Moriarty to create your changesets:

    
#php
define('STORE_URI', 'http://api.talis.com/stores/sandbox1');
$markerURI =  STORE_URI.'/items/'.time();
$time = time();

$rdfToAdd = " <{$markerURI}> <http://purl.org/dc/terms/created> \"{$time}\" . ";

$args = array(
    'before' => $rdfToRemove, // got this from the CONSTRUCT described above
    'after' => $rdfToAdd, // this is the marker triple
);
$cs = new Changeset($args);
$store = new Store(STORE_URI); // you will probably need to add your login credentials - see the moriarty docs
$response = $store->get_metabox()->apply_changeset($cs);

if(!$response->is_success()){
    //log error, and stop
    log_error("Changeset failed: ".$response->status_code ."\n " . $response->body ." \n  Changeset: \n" . $cs->to_rdfxml());
    break;
}

    

At this point, you may also want to poll the store for the existence of your marker triple to see if the batch has been processed. Since we minted the marker URI in the store’s URI space, we can just try to dereference it; as soon as we get back a 200 or 303 response, we can move on, but if we still get back a 404 after say 10 seconds, the changeset has probably failed and we need to log that and investigate.

If everything goes OK, you can then make double sure you’ve got rid of all the bad triples by running a quick ASK query against your store’s SPARQL service.

    
        ASK {
            # bad
            ?s <http://purl.org/dc/elements/1.1/creator> ?o
        }
    

And if you get back FALSE, you’re finished.

Well done!

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.