Quick OpenCalais Hack
I’ve been doing some more work on the Ruby client for the Platform recently, and one of my main goals is to provide functionality that makes it easier to copy, merge, interlink and relate together datasets. So far I’ve been concentrating on providing some framework code to make it easier to mash-up data across SPARQL endpoints, but there are many more services that one might want to use when enriching a dataset.
One of those services is OpenCalais. I’ve played with the service on and off, and have previously built a Java client to the service to explore similar functionality using Java and Jena. But as I’m primarily working with Ruby at the moment, I thought I’d look for a Ruby client for Calais. Happily there is one on Github and its available as a Ruby gem.
Documentation is a bit light, and I had to jump through a few hoops to get it working, needing to manually install the curb gem and some native libraries, the following worked for me:
sudo apt-get install libcurl3-dev
sudo gem install curb
sudo gem install calais
With that installed it was a breeze to run a document through the OpenCalais service, and then store the resulting RDF in the Platform:
# Use OpenCalais to find entities in a document specified on the command-line, then store the results
# in a Platform store
#
# Set the following environment variables:
#
# TALIS_USER:: username on Platform
# TALIS_PASS:: password
# TALIS_STORE:: store in which data will be stored
# CALAIS_KEY:: Calais license key
require 'rubygems'
require 'pho'
require 'calais'
store = Pho::Store.new(ENV["TALIS_STORE"], ENV["TALIS_USER"], ENV["TALIS_PASS"])
content = File.new(ARGV[0]).read()
resp = Calais.enlighten( :content => content, :content_type => :text, :license_id => ENV["CALAIS_KEY"])
resp = store.store_data(resp)
puts resp.status
The code is here, and here’s some sample input and sample output.
The code is pretty trivial and error handling is non-existent, but I was pleased with how easy it was to get some data out of OpenCalais and pushed into the Platform. A bit of SPARQL can then be used to do some analysis or further processing of the results
So how do I plan to use this?
As a personal project I’m building out a dataset of NASA space-flight data, this will also include some metadata about astronauts and their roles on each mission. What I want to do is take some documents from the web and then store additional data to state relationships like “Buzz Aldrin is the foaf:primaryTopic of this document”.
The workflow I’m considering is using a Google custom search to give me a high-level index of content, e.g. selecting only the NASA websites. I can then run some representative searches to find documents use OpenCalais to do entity extraction on each result. I can then store the OpenCalais RDF data in the store in a private graph — as I don’t want the raw data in the main dataset — I want to assert triples using my ids and preferred vocabularies.
If the data is in a private graph then I can use the stores’ multisparql service to do some SPARQL queries to match up the resources and CONSTRUCT new triples to store in the public graph.
I’ll post again with some more details on this as I progress, but I thought I’d start out by showing just how simple it is to mashup OpenCalais and the Talis Platform.
Don’t forget, if you want a Platform store to play with for development purposes then drop us a line.


June 8th, 2009 at 12:22 pm
Leigh:
Tom Tague from OpenCalais here.
Thanks for the post. This is great. We always appreciate it when someone takes the time to publish examples / cookbooks that help jump start others exploration of Calais.
For those of you not directly involved in NASA space flight data you might see interesting results taking in full text news feeds such a the Huffington Post or the Reuters Spotlight feed (or many others). By automatically adding semantic metadata (entities are interesting – but events like natural disasters, product recalls, management changes, bankruptcies are even more so) you can quickly create a very interesting exploratory database around a particular domain (sports, business, etc). With a combination of SPARQL plus the Calais Linked Data assets you can ask interesting questions like “What natural disasters were mentioned in relation to South America” or “What Sports Teams are mentioned that are located in New York”. You get the idea..
Again, thanks for the examples.
Regards,