Subscribe

Quick OpenCalais Hack

I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.

One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.

Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:

sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais

With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:


# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'

store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status

The code is here, and here’s some sample input and sample output.

The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results

So how do I plan to use this?

As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.

The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.

If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.

I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.

Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.

Pho 0.5

I’ve just sent a quick announcement to the n2-dev mailing list to announce a new release of Pho the Ruby client for the Talis Platform.

The CHANGES document includes a complete summary of the changes made since the 0.4.1 release. There have been quite a few of these as I’ve added more code for handling submission of directories of files to the Content Box; initial support for Changesets to support updates of RDF graphs, including support for “diffing” two graphs to generate the appropriate changesets for submission to the Platform; as well as a SPARQL protocol client and a number utility classes to make working with SPARQL queries slightly easier.

The addition of RDF/JSON for serializing results of CONSTRUCT and DESCRIBE queries makes it much easier to work with query results as they’re much easier to parse.

The latest release does add a dependency on the Redland Ruby language bindings to provide support for parsing of RDF/XML, Turtle and N-Triples. So you’ll need to install the relevant Redland packages to make use of the latest release.

Pho 0.4 — Job Control and Command-Line Tool

I’ve just uploaded the latest release of Pho, the Ruby client for the Talis Platform. The primary changes in the 0.4 release is a reworking of the code relating to Snapshots and Jobs to provide access to the detailed job lifecycle data that was added in Release 21 of the Platform.

Because jobs are executed asynchronously, the API now includes code to wait for a job to finish, with an option to monitor the progress updates.

For example the follow code fragment illustrates how to submit a reindex job, and then report on its progress:


store = Pho::Store.new("http://api.talis.com/stores/my-store", "user", "pass")
resp = Pho::Jobs.submit_reindex(store)
job = Pho::Jobs.wait_for_submitted(resp, store) do | job, message, time |
  puts "#{time} #{message}"
end

The wait_for_submitted method takes the response from submitting a job to the Platform and will then poll the API to wait until the job has completed. When the method returns it returns a Job object populated with the progress updates; its possible to also determine if the job was successful or not. If you pass a block to the method then the code will be called for each newly encountered progress update, including the start message and completion message.

This release also includes a fledgling command-line application for working with a Platform Store. Once you’ve installed the job you’ll have a new script called talis_store which you can use to interact with a platform store. There’s still some work to be done to make it nicer to use, but it covers the majority of the core operations already. To get help on this run:

talis_store help

As a quick example, here’s how to take a snapshot of a store and then download that snapshot to a local directory. In the process of downloading the snapshot the MD5 will be automatically verified:


talis_store backup -u user -p pass -s my-store -d ~/backups

For the next release I’m planning to add support for Changesets, which is format for describing changes to RDF graphs.

Building a Custom Search Index

The Platform is more than just a triple store with a SPARQL interface. It provides a number of other services which are useful for application developers. The most useful of these is the built-in search engine. Each Platform Store has its own search engine that can be used to perform queries over the hosted metadata. So as well as having the option to query your data using the SPARQL query language, you also have the ability to do simple queries over the data with results being returned as RSS 1.0 (with the OpenSearch extensions). This is a nice feature as sometimes you don’t need the full power of SPARQL and for some use cases a more specialized text indexing system is a better option.

The Platform API allows you to configure the system to build a full-text index over any or all of the RDF literals in your stored data. The exception to this the RDF type predicate, this is the only predicate that will have resource values indexed, making it possible for you to construct a search index and queries that can be used to find matches in specific types of RDF resource.

The remainder of this post shows how to configure the Platform to build a custom search index, with example Ruby code using Pho.

Its common in search engine syntax to use a simple friendly name to identify a specific field that you want to search. For example in a Google search you can use “intitle:Blah” to search for the text “blah” only in the HTML title element of indexed pages. The Platform uses a similar mechanism to allow you to map any RDF property URI to a short friendly name suitable for submitting in a search query.

The complete set of these mappings are referred to as a FieldPredicateMap. The mapping is specific to each store, allowing different stores to have their own mappings. The Platform API exposes these mappings allowing you to retrieve and update the mappings yourself.

It is the presence of a mapping of a property URI to a friendly name that triggers the Platform to start indexing the literal values associated with that property. To put this another way: all you have to do to start indexing your literals is define a mapping. Its a simple as that.

Once a mapping is in place, whenever you submit some RDF/XML to your store, the Platform will automatically index all of the mapped triples. The indexing is done asychronously so there might be a short delay between the deposit of new content and the indexes being updated. Standard stuff.

The Pho Ruby API for the Platform provides programmatic access to this functionality, allowing you to script up the management and creation of the FieldPredicateMap. See the rdocs for the FieldPredicateMap class for details.

Here’s an example Ruby script that illustrates how to manage mappings.

To run the script you’ll first need to fill in the name of your store and your admin username and password. You’ll also need to make sure you’ve installed Pho: gem install pho should do the necessary.

The script does several things. Once a store object has been created, the script creates two new mappings. One for the FOAF name predicate, and one for RDF type:


#create the mappings we want
name = Pho::FieldPredicateMap.create_mapping(store, "http://xmlns.com/foaf/0.1/name", "name")

type = Pho::FieldPredicateMap.create_mapping(store, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "type")

The create_mapping method allows you to quickly generate a mapping suitable for adding to a specific store. In order to fetch the current list of mappings the script then does:


#read the existing mappings
mappings = Pho::FieldPredicateMap.read_from_store(store)

#remove anything for this uri
mappings.remove_by_uri("http://xmlns.com/foaf/0.1/name")
mappings.remove_by_uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

#append the new field-name mappings
mappings << name
mappings << type

The read_from_store method does the actual work, doing a GET request to the Platform to retrieve the mappings as JSON, which are then parsed into some useful Ruby objects. The remaining lines then add the newly created mappings to the current collection, after first ensuring that any previous mappings for those URIs have been removed. At this stage we’ve updated our local copy of the mappings but have not yet saved them back to the Platform.

Storing the updated mappings in the Platform is then just a matter of calling the upload method on the mapping object. This serializes the list of mappings as RDF/XML and then PUTs them back to the store. This will overwrite any of the current configuration with the updated copy we’ve got locally: this is one reason why we fetch the current copy before making the changes, to ensure the rest of the configuration is preserved.


resp = mappings.upload(store)
if resp.status_code != 200
  abort("Failed to upload mappings!")
end

The upload method, like many of the lower-level method calls in the Pho library return an HTTP::Message object that you can inspect to determine if the Platform request was successul.

The remaining lines in the sample script simply upload some test data to your store: astronauts.rdf contains a short list of a few astronauts, modelled as simple foaf:Person instances with a foaf:name property. This allows you to test out your newly created search index.

You can now construct item searches with syntax like “name:Buzz” to search for the name Buzz in any foaf:name predicate. Or you can find all foaf:Person instances by performing a search for:

type:"http://xmlns.com/foaf/0.1/Person"

Note that you have to quote the predicate URI. And you can obviously combine those to find only foaf:Person resources with a specific foaf:name.

I’ve run the script against the n2-examples store, so you can use the item search form to test it out. Or just click here to list all the people.

If you peek at the source of the returned RSS feed you’ll find that the essential metadata for each result — in this case the foaf:name property and the rdf:type — is automatically included. Incidentally if you have a FieldPredicateMapping defined with a property name of title then this will automatically be used as the title for the RSS item, allowing you some minor degree of control over the feed structure if you wanted to make it more human-readable.

The Platform provides you with a few more options for managing your search indexes than I’ve covered here. For example the FieldPredicateMap can also be used to associate an Analyzer with the field allowing you to control the indexing rules. You can also control the relevance ranking of the search results through the use of a Query Profile (which is also exposed through an API, and is manageable using Pho). The query profile lets you associate a weighting with a field, so that when a user performs a search without indicating which field they want to search (thereby searching all fields), then the Platform will alter the relevance ranking of the results to suit your preferences.

That concludes our look at the basic steps involved in building a custom search index over the Platform. While the Pho library provides some useful support its worth remembering that its simply a thin veneer over several HTTP operations so achieving the same effects in another language — or even from the console using plain old curl should be easy enough. Hopefully the examples have also illustrated the simplicity of working with the Platform to create some quite powerful features and, importantly, that developing against the Platform doesn’t require your to be a SPARQL wizard: there are other ways to get data out of the system, but the power of SPARQL is there when you need it.

Any questions, then leave a comment and I’ll try to answer them.

Tip: Mirroring a directory of RDF/XML into a Platform Store

When converting data to RDF whether as a result of scraping it from the web; locally analysing a dataset; or simply dumping data from a database. I very often collect the data into a number of RDF/XML files before loading it into a Platform store. Loading the store is de-coupled from the data munging process which typically goes through rapid cycles of development. So I only occasionally publish the data into the Platform to publish it for others to use, or to test within an application.

Whilst writing Pho one of my motivations was to make it easier to support this kind of workflow. For example I’ve made sure that its easy to submit jobs to the Platform to allow a store to be reset before being re-populated. This is as simple as:


store = Pho::Store.new("http://api.talis.com/stores/xyz", "user", "pass")
store.reset()

In the next release I’ll add the necessary support for polling the returned job metadata so that client code can wait until the job is completed before continuing.

But there’s already one useful chunk of code in the form of the RDFCollection class. This provides a simple utility for easily mirroring a directory full of RDF/XML documents into a Platform store. It handles and captures errors; has some support for allowing a load to be resumed if it has to be killed; and checks for new files so it only has to load new content.

Here’s a trivial example of how it works. Lets assume that the directory /example/rdf contains two RDF files: good.rdf and bad.rdf. As the name suggests the second of these files is malformed, so will be rejected by the Platform. If I want to load the contents of the directory into the store I can write:


require 'rubygems'
require 'pho'

#use your own store name and credentials
store = Pho::Store.new("http://api.talis.com/stores/xyz", "user", "pass")

collection = Pho::RDFCollection.new(store, "/example/rdf")
collection.store()
puts collection.summary()

This will POST all the RDF/XML found in the directory and print a simple summary at the end, something like:


/example/rdf contains 2 files: 1 stored, 1 failed, 0 new

The summary indicates that, as expected, one of the files was stored and one was rejected. If you were to look in the (imaginary!) directory you’d now see a couple of extra files: good.ok and bad.fail. The RDFCollection code uses “.ok” files to note which files have been stored and “.fail” files to indicate rejections. The fail files contain a dump of the platform HTTP response. Any files that aren’t accompanied by either of these are considered to be new, i.e. ripe for submission.

If you were to re-run the above script only new files would be resubmitted. If you wanted to re-try failures then you could use:


collection.retry_failures()

While the summary is useful, its more likely that you want to script up a load and then iterate over the list of failures in the directory, perhaps to generate a more complete report. Its easy to do that using:


collection.failures().each do |failed|
...
end

If you want to clear out all of the status tracking files and attempt to resubmit all of the data, then you can just do:


collection.reset()
collection.store()

Obviously there are much slicker ways that the submission status of the files can be tracked and I may well integrate these into later iterations, but I found this to be good enough for a first pass as it supports my initial use case. The technique should also be useful for managing test data when developing against the Platform.

Pho: A Ruby Client for the Talis Platform

This is a short blog post to announce a project I’ve been working on in my spare time. Pho, is a Ruby client for the Talis Platform. Its hosted on Rubyforge so getting started couldn’t be easier:

gem install pho

Will download and install the Pho gem, along with the documentation which you can also read online.

The distribution comes with a couple of example scripts that show how to add items to the Content Box, perform SPARQL queries, check status of a store, etc.

However the API currently does a lot more than that giving you full access to all of the core Platform services including: storing binary data & RDF metadata, SPARQL queries, faceted browsing, job control, store configuration options, etc.

There’s still plenty of work to be done but at version 0.3 I think there’s enough functionality available that you can build useful applications using the API. For example there’s sufficient code there now to use the library to script some simple data management activities for publishing or managing data in the Platform, or to build a simple linked data browser. I hope to post examples of doing exactly those things over the next few weeks, as I’m planning some updates to my space data store (briefly described here) that will be handled using Pho.

The next steps are to plug some of the gaps in the API  — specifically parsing of search results, access to the job metadata we exposed in version 21, and better support for changeset management. I’m also going to explore some simple Ruby-RDF mapping functionality. The latter should help turn Pho into something that can provide the core functionality required for building linked data backed applications.

I’d love to get feedback on this, so feel free to post bug reports or feature suggestions either on the Rubyforge project or the n2-dev mailing list.