Subscribe

Archive for the 'Tips and Tricks' Category

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Fishing for BBC Data using Augmentation

In some of my recent talks I’ve used the metaphor of streams, pool and reservoirs for describing the flow and collection of data across the web. I usually refer to some of the different forms of data extraction that we support on the Platform, which covers keyword searching as well as more structured queries.

Another form of data extraction is the Augmentation Service is what might be described as “fishing for data, using URIs as bait”. I thought I’d put together a little illustration that shows the potential for this kind of data extraction, as its both powerful and simple to use — so simple that you don’t need to write any queries at all.

Lets look at a sample RSS 1.0 feed that contains a review of an episode of Dr Who. For brevity, I’ll only include the metadata for the single item in the feed:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
  <rev:text>A classic episode of Dr Who...</rev:text>
  <foaf:primaryTopic rdf:resource="http://www.bbc.co.uk/programmes/b0074gpl#programme"/>
</item>

The item has the standard RSS 1.0 elements for title and link, but as the item is also a review, it also includes some additional metadata using the review vocabulary. The relationship between the review item and the Episode that is being reviewed is made using the foaf:primaryTopic property. The precise vocabularies don’t really matter, the important thing is that there is a reference to an BBC /programmes URI: this is our bait.

The Augmentation Service allows the URL of an RSS 1.0 feed to be passed in as a parameter. You can use the form provided from the augment service on the BBC Backstage store and paste in the URL of the sample RSS 1.0 feed, or click here to review the results. Within the browser you won’t see that a great deal as changed, although you should see that that the results are themselves an RSS 1.0 feed. What the Augmentation service does is process an RSS feed to augment the metadata in the feed items against data present in the Platform Store.

Here’s the same RSS item after its been augmented, with the additional metadata shown in red:

<item rdf:about="http://www.example.org/reviews/1">
  <title>Review of "Blink"</title>
  <link>http://www.example.org/reviews/1</link>
  <foaf:primaryTopic>
 <ns.0:Episode rdf:about="http://www.bbc.co.uk/programmes/b0074gpl#programme">
  <ns.0:medium_synopsis>In an old, abandoned house, the Weeping Angels wait.
  Only the Doctor can stop them, but he's lost in time.</ns.0:medium_synopsis>
  <rdf:type>
    <rdf:Description rdf:about="http://purl.org/ontology/po/Episode"/>
  </rdf:type>
  <ns.0:position>10</ns.0:position>
  <ns.0:short_synopsis>Only the Doctor can stop the Weeping Angels, but he's lost in time.</ns.0:short_synopsis>
  <ns.0:genre>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/genres/drama/scifiandfantasy#genre"/>
  </ns.0:genre>
  <ns.0:microsite>
    <rdf:Description rdf:about="http://www.bbc.co.uk/doctorwho/"/>
  </ns.0:microsite>
  <ns.0:version>
    <rdf:Description rdf:about="http://www.bbc.co.uk/programmes/b0073km9#programme"/>
  </ns.0:version>
  <foaf:depiction>
    <rdf:Description rdf:about="http://www.bbc.co.uk/iplayer/images/episode/b0074gpl_512_288.jpg"/>
  </foaf:depiction>
  <ns.1:label>Blink</ns.1:label>
  <ns.0:masterbrand>
    <rdf:Description rdf:about="http://www.bbc.co.uk/bbcone#service"/>
  </ns.0:masterbrand>
  <dc:title>Blink</dc:title>
 </ns.0:Episode>
</foaf:primaryTopic>
<rev:text>A classic episode of Dr Who...</rev:text>
<rev:title>Review of Dr Who Series 3, Episode 10 "Blink"</rev:title>
</item>

As you can see the feed now includes all of the key metadata about the episode, including its title, a synopsis, a link to a depiction of the episode, and to the Dr Who microsite on the BBC. All without writing any queries.

The trigger for the augmentation to looking up the data is simply the presence of a URI in the feed, that is also present in the RDF in the Platform Store. If the URI is not found then it is ignored. But if the URL is present then a description of that resource is automatically added to the RSS feed. In formal RDF terms that description is the Concise Bounded Description of the resource. More simplistically it will be all simple literal properties associated with the resource (e.g. the title and the synopsis) plus links to any related resources (e.g. the microsite, the genre). The end result is a feed that has been either completely or partially enriched against the data.

This kind of data augmentation is uniquely possible with RDF because of its reliance on URIs for global identifiers. Its makes dipping into a pool of data very easy to do. It’s also possible to augment a service against multiple stores, pipelining the augmentation across multiple datasets to gather up all of the relevant data. As the output of a search against a Platform store is also RSS 1.0, you can enrich search results against multiple stores starting from an initial keyword search.

You can also see how this kind of enrichment can be used as part of, e.g. a Yahoo Pipeline. This is the primary reason why the service has been initially designed to work on RSS 1.0 feeds — its well supported; easy to generate; and of all the varieties of RSS, RSS 1.0 is processable as both an RDF and an XML vocabulary, making it easy to process in this context. We are intending to expand the support to cover generic RDF input and output, and other flavours of RSS.

In the meantime, happy data fishing!

Augmenting Last.fm Data with BBC data on the Talis Platform

A short while back, I created a Linked Data wrapper on the Last.FM API for Events and Artists. The artist data links to the BBC’s data about each artist using owl:sameAs.

Now that the BBC RDF is available in a Talis Platform store, I can put some of my Last.FM data into a store (it’s currently generated on the fly from the Last.FM API), search on it, and then augment it with data from the BBC.

So I put some Last.FM data into the Sandbox1 store.

Now I can search on it with the items query endpoint like:

http://api.talis.com/stores/sandbox1/items?query=Black

This gives us the results as RSS 1.0, which is also RDF/XML, and contains a graph with 12 resources in it.

We can now pass the URI of this (or any RSS 1.0) document to the BBC-Backstage store’s Augment Service like this:

http://api.talis.com/stores/bbc-backstage/services/augment?data-uri=http%3A%2F%2Fapi.talis.com%2Fstores%2Fsandbox1%2Fitems%3Fquery%3DBlack

The Augment service will look at the URIs in the RSS results, and add DESCRIBEs for any of those URIs that it finds in its own store, giving you back the RSS augmented with BBC data.

So the graph we get back now contains 15 resources, where the BBC-Backstage store has found descriptions for 3 of the URIs in the original RSS.

For further information, see Leigh Dodd’s slides on Getting Started with the Talis Platform.

SPARQL AJAX Client Library and Example

Over the past few years I’ve tinkered with a number of different implementations of an AJAX client library for SPARQL. Before a standard format for SPARQL JSON results was created, this involved having to jump through the extra hoops of parsing the XML format. But things are much easier now, especially when the JSON support is extended to include the results of CONSTRUCT and DESCRIBE queries.

My personal favourite SPARQL client library though is the one produced by Lee Feigenbaum, Elias Torres, and Wing Yung as part of their work on the SPARQL Calendar Demo.

While the sparql.js library only supports JSON it does have a few convenience features which I like, including global PREFIX bindings and some functions for automatically processing the JSON results to produce some simpler javascript objects (e.g. arrays and hashes) that simplify some scripting tasks and make code more readable.

Using this on the Platform is quite straight-forward, as you can upload this library, and any other related Javascript files directly into the Contentbox of your store. This not only avoids any cross-domain issues, but also means that you can deploy simple AJAX applications directly from a store.

I’ve put together a super simple demo that uses the NASA spaceflight data. The source code is here, and I’ve uploaded the two files into the n2-examples store contentbox, so you can play with the running application.

The demo simply fetches the name, homepage, description and launch date for every spacecraft launched in a particular year, also retrieving a link to a photo if there’s one available. The results are dropped into an HTML table for viewing.

The code is well commented so rather than repeat that here, you can look through the Javascript file that does the actual interaction. I’ve used JQuery to help with the DOM manipulation, etc. This is delivered through the Google JQuery CDN rather than the Platform. But the rest of the application is served directly from the Platform.

A rather easy and trivial example, but sometimes its useful to reiterate the basics. And if you want to incorporate the NASA spaceflight data in your own mashups, then you can do so easily by simple using the version of sparql.js in the space data store.

In my view, SPARQL + JSON + scripting languages like JS and Ruby hit a nice sweet spot for working with RDF, especially with the ability to bring together data from multiple sources using a single standard API.

Note: Keith Alexander has written up some of his own experiments with playing with JQuery against the platform here and here. His JQuery plugin provides some additional Platform specific functionality.

Quick OpenCalais Hack

I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.

One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.

Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:

sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais

With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:


# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'

store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status

The code is here, and here’s some sample input and sample output.

The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results

So how do I plan to use this?

As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.

The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.

If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.

I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.

Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) - grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm - too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

Building a Custom Search Index

The Platform is more than just a triple store with a SPARQL interface. It provides a number of other services which are useful for application developers. The most useful of these is the built-in search engine. Each Platform Store has its own search engine that can be used to perform queries over the hosted metadata. So as well as having the option to query your data using the SPARQL query language, you also have the ability to do simple queries over the data with results being returned as RSS 1.0 (with the OpenSearch extensions). This is a nice feature as sometimes you don’t need the full power of SPARQL and for some use cases a more specialized text indexing system is a better option.

The Platform API allows you to configure the system to build a full-text index over any or all of the RDF literals in your stored data. The exception to this the RDF type predicate, this is the only predicate that will have resource values indexed, making it possible for you to construct a search index and queries that can be used to find matches in specific types of RDF resource.

The remainder of this post shows how to configure the Platform to build a custom search index, with example Ruby code using Pho.

Its common in search engine syntax to use a simple friendly name to identify a specific field that you want to search. For example in a Google search you can use “intitle:Blah” to search for the text “blah” only in the HTML title element of indexed pages. The Platform uses a similar mechanism to allow you to map any RDF property URI to a short friendly name suitable for submitting in a search query.

The complete set of these mappings are referred to as a FieldPredicateMap. The mapping is specific to each store, allowing different stores to have their own mappings. The Platform API exposes these mappings allowing you to retrieve and update the mappings yourself.

It is the presence of a mapping of a property URI to a friendly name that triggers the Platform to start indexing the literal values associated with that property. To put this another way: all you have to do to start indexing your literals is define a mapping. Its a simple as that.

Once a mapping is in place, whenever you submit some RDF/XML to your store, the Platform will automatically index all of the mapped triples. The indexing is done asychronously so there might be a short delay between the deposit of new content and the indexes being updated. Standard stuff.

The Pho Ruby API for the Platform provides programmatic access to this functionality, allowing you to script up the management and creation of the FieldPredicateMap. See the rdocs for the FieldPredicateMap class for details.

Here’s an example Ruby script that illustrates how to manage mappings.

To run the script you’ll first need to fill in the name of your store and your admin username and password. You’ll also need to make sure you’ve installed Pho: gem install pho should do the necessary.

The script does several things. Once a store object has been created, the script creates two new mappings. One for the FOAF name predicate, and one for RDF type:


#create the mappings we want
name = Pho::FieldPredicateMap.create_mapping(store, "http://xmlns.com/foaf/0.1/name", "name")

type = Pho::FieldPredicateMap.create_mapping(store, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "type")

The create_mapping method allows you to quickly generate a mapping suitable for adding to a specific store. In order to fetch the current list of mappings the script then does:


#read the existing mappings
mappings = Pho::FieldPredicateMap.read_from_store(store)

#remove anything for this uri
mappings.remove_by_uri("http://xmlns.com/foaf/0.1/name")
mappings.remove_by_uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

#append the new field-name mappings
mappings << name
mappings << type

The read_from_store method does the actual work, doing a GET request to the Platform to retrieve the mappings as JSON, which are then parsed into some useful Ruby objects. The remaining lines then add the newly created mappings to the current collection, after first ensuring that any previous mappings for those URIs have been removed. At this stage we’ve updated our local copy of the mappings but have not yet saved them back to the Platform.

Storing the updated mappings in the Platform is then just a matter of calling the upload method on the mapping object. This serializes the list of mappings as RDF/XML and then PUTs them back to the store. This will overwrite any of the current configuration with the updated copy we’ve got locally: this is one reason why we fetch the current copy before making the changes, to ensure the rest of the configuration is preserved.


resp = mappings.upload(store)
if resp.status_code != 200
  abort("Failed to upload mappings!")
end

The upload method, like many of the lower-level method calls in the Pho library return an HTTP::Message object that you can inspect to determine if the Platform request was successul.

The remaining lines in the sample script simply upload some test data to your store: astronauts.rdf contains a short list of a few astronauts, modelled as simple foaf:Person instances with a foaf:name property. This allows you to test out your newly created search index.

You can now construct item searches with syntax like “name:Buzz” to search for the name Buzz in any foaf:name predicate. Or you can find all foaf:Person instances by performing a search for:

type:"http://xmlns.com/foaf/0.1/Person"

Note that you have to quote the predicate URI. And you can obviously combine those to find only foaf:Person resources with a specific foaf:name.

I’ve run the script against the n2-examples store, so you can use the item search form to test it out. Or just click here to list all the people.

If you peek at the source of the returned RSS feed you’ll find that the essential metadata for each result — in this case the foaf:name property and the rdf:type — is automatically included. Incidentally if you have a FieldPredicateMapping defined with a property name of title then this will automatically be used as the title for the RSS item, allowing you some minor degree of control over the feed structure if you wanted to make it more human-readable.

The Platform provides you with a few more options for managing your search indexes than I’ve covered here. For example the FieldPredicateMap can also be used to associate an Analyzer with the field allowing you to control the indexing rules. You can also control the relevance ranking of the search results through the use of a Query Profile (which is also exposed through an API, and is manageable using Pho). The query profile lets you associate a weighting with a field, so that when a user performs a search without indicating which field they want to search (thereby searching all fields), then the Platform will alter the relevance ranking of the results to suit your preferences.

That concludes our look at the basic steps involved in building a custom search index over the Platform. While the Pho library provides some useful support its worth remembering that its simply a thin veneer over several HTTP operations so achieving the same effects in another language — or even from the console using plain old curl should be easy enough. Hopefully the examples have also illustrated the simplicity of working with the Platform to create some quite powerful features and, importantly, that developing against the Platform doesn’t require your to be a SPARQL wizard: there are other ways to get data out of the system, but the power of SPARQL is there when you need it.

Any questions, then leave a comment and I’ll try to answer them.

Vocabify: Instance Data -> Vocab

One thing about writing RDF vocabularies that occurred to me listening to people talk at VoCamps (Oxford and Galway), is that typically what you are trying to do isn’t defining new terms, it’s modeling data, and at some stage in the modeling you discover you need to write a new vocabulary. Vocabulary authors often want to describe how their terms can best be used with existing complimentary vocabularies, like FOAF and Dublin Core, but the only commonly practiced way of doing so is to put it in human-readable form in the documentation annotations. In voiD, we wrote a guide, principally because we wanted to describe how the terms ought to be used together with existing vocabulary terms.

In tandem with this thought, when sketching out vocabularies myself, I tend not to start out by defining Classes and Properties, which is both tediously repetitive, and a step removed from the data-modeling (which is what I’m actually trying to do in the first place). Instead, I define a prefix for a new namespace, and pretend a vocabulary already exists at it. Probably quite a lot of people do this. I think of them as “pretend schemas“; I’ve heard ldodds call them “just in time schemas” (only bother to write it when someone actually asks to see it).

So last night I coded up Vocabify, which you can feed some instance data that uses your “just in time vocabulary“, tell it which namespace URI is the pretend one, and it will generate a schema from the instance data, which you can then edit and publish.

The classes and properties are also linked to the instances they are generated from with ov:exampleResource, so it is clear to readers how they can be used together with other properties.

Tip: Mirroring a directory of RDF/XML into a Platform Store

When converting data to RDF whether as a result of scraping it from the web; locally analysing a dataset; or simply dumping data from a database. I very often collect the data into a number of RDF/XML files before loading it into a Platform store. Loading the store is de-coupled from the data munging process which typically goes through rapid cycles of development. So I only occasionally publish the data into the Platform to publish it for others to use, or to test within an application.

Whilst writing Pho one of my motivations was to make it easier to support this kind of workflow. For example I’ve made sure that its easy to submit jobs to the Platform to allow a store to be reset before being re-populated. This is as simple as:


store = Pho::Store.new("http://api.talis.com/stores/xyz", "user", "pass")
store.reset()

In the next release I’ll add the necessary support for polling the returned job metadata so that client code can wait until the job is completed before continuing.

But there’s already one useful chunk of code in the form of the RDFCollection class. This provides a simple utility for easily mirroring a directory full of RDF/XML documents into a Platform store. It handles and captures errors; has some support for allowing a load to be resumed if it has to be killed; and checks for new files so it only has to load new content.

Here’s a trivial example of how it works. Lets assume that the directory /example/rdf contains two RDF files: good.rdf and bad.rdf. As the name suggests the second of these files is malformed, so will be rejected by the Platform. If I want to load the contents of the directory into the store I can write:


require 'rubygems'
require 'pho'

#use your own store name and credentials
store = Pho::Store.new("http://api.talis.com/stores/xyz", "user", "pass")

collection = Pho::RDFCollection.new(store, "/example/rdf")
collection.store()
puts collection.summary()

This will POST all the RDF/XML found in the directory and print a simple summary at the end, something like:


/example/rdf contains 2 files: 1 stored, 1 failed, 0 new

The summary indicates that, as expected, one of the files was stored and one was rejected. If you were to look in the (imaginary!) directory you’d now see a couple of extra files: good.ok and bad.fail. The RDFCollection code uses “.ok” files to note which files have been stored and “.fail” files to indicate rejections. The fail files contain a dump of the platform HTTP response. Any files that aren’t accompanied by either of these are considered to be new, i.e. ripe for submission.

If you were to re-run the above script only new files would be resubmitted. If you wanted to re-try failures then you could use:


collection.retry_failures()

While the summary is useful, its more likely that you want to script up a load and then iterate over the list of failures in the directory, perhaps to generate a more complete report. Its easy to do that using:


collection.failures().each do |failed|
...
end

If you want to clear out all of the status tracking files and attempt to resubmit all of the data, then you can just do:


collection.reset()
collection.store()

Obviously there are much slicker ways that the submission status of the files can be tracked and I may well integrate these into later iterations, but I found this to be good enough for a first pass as it supports my initial use case. The technique should also be useful for managing test data when developing against the Platform.

Using Twinkle to SPARQL the Platform

A few years ago I wrote Twinkle, a simple GUI interface for working with SPARQL. While its not the most polished of user interfaces and its in sore need of an update, it’s still serviceable and has been successfully used as a development tool by teams of engineers I’ve worked with in the past.

I gave a short talk on Twinkle at an Oxford SWIG meeting, so you can flick through the slides if you want a quick overview of the functionality. I also moved the code to a google code project to start the process of updating it

Twinkle has the capability to work with a range of different data sources and includes a full SPARQL client, so you can use it to work with any SPARQL endpoint that is accessible from your desktop. Out of the box Twinkle is already configured to work with the Govtrack and DbPedia endpoints, but you can easily add more by changing the configuration.

If you download and unzip the distribution into a directory you should end up with an etc/config.n3 file. This file contains all of the configuration that drives the user interface, including a section that configures remote SPARQL endpoints, e.g:


<http://dbpedia.org/sparql> a sources:Endpoint
    ; sources:defaultGraph "http://dbpedia.org"
    ; rdfs:label "DBpedia.org".

<http://www.rdfabout.com/sparql> a sources:Endpoint
    ; rdfs:label "GovTrack.us".

The above snippet configures two remote endpoints, and applies labels to them so that they appear in the Twinkle UI, under the “Remote Services” section on the left-hand menu. Because some endpoints, such as DbPedia, require to specify a default graph in the SPARQL protocol request, you can also specifiy that in the configuration if necessary.

If you have a Platform Store, or just want to access some data held in the Platform, then you can use Twinkle to perform your SPARQL queries. For example I have a store containing NASA space flight data. The SPARQL endpoint for this store is at:

http://api.talis.com/stores/space/services/sparql

So to register this in Twinkle, I can edit the configuration file and include the following snippet:


<http://api.talis.com/stores/services/sparql> a sources:Endpoint
    ; rdfs:label "NASA Space Data".

Once you’ve restarted the UI you should now be able to click on the Remote “NASA Space Data” service and open up a window into which you can start executing SPARQL queries.

If you’re new to SPARQL, or are interested in playing with the above space data, then you can look over the following slides from a recent SPARQL training session that I ran:


By rob

The slides contain a number of sample queries that should help get you started. Unfortunately some of the diagrams don’t look great in slideshare, but you should be able to download them for a closer look.