Subscribe

Quick OpenCalais Hack

I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.

One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.

Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:

sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais

With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:


# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'

store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status

The code is here, and here’s some sample input and sample output.

The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results

So how do I plan to use this?

As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.

The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.

If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.

I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.

Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.