Subscribe

Archive for the 'Projects' Category

Store Admin Interface

If you have a Talis store, or even if you’re just interested in browsing around existing talis stores, you might be interested in an admin interface  I’ve been working on.

Once you have selected a store, you can browse resources by type (rdf:type), search across the contentbox index, edit resources, view pending jobs and send new ones, import data, and configure the field-predicate mapping for your stores.

Please send bug reports and feature requests to keith dot alexander at talis.com

If you do want a talis store, just ask in #talis on irc.freenode.net, or email danny dot ayers  at talis.com

Moriarty Development List

I noticed that I was the only one getting notificiations of commits to Moriarty’s subversion. I thought the best way to fix that was to create a Google group for moriarty and ensure the commit reports get sent there. So if you’re interested in keeping track of changes to Moriarty please sign up: moriarty-dev

Exploring OpenLibrary Part One

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:


{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks         "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks         "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them - then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.com/a/OL32312A>
	foaf:Name "Banks, Iain";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
	bio:event <http://example.com/a/OL32312A#birth>;
	a foaf:Person .

<http://example.com/a/OL32312A#birth>
	bio:date "1954";
	a bio:Birth .

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.


@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
	mine:name_of <http://example.com/a/OL32312A>;
	a mine:Name .

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges - these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…

Alternative to CURL in Moriarty

I just checked in a small update to moriarty that might solve a problem some people have experienced using curl. It appears that even though curl implemented support for HTTP digest way back in 2003 with version 7.10.6, it took several more releases to iron out the bugs. The version I develop with 7.18.0 (and the version installed on Talis application servers) works without issue, but many webhosts have much older versions. In fact my own webhost is still on 7.10.6 which means that digest authentication doesn’t work as expected. To date there has been no workaround. The latest change to Moriarty adds support for using httpclient written by Manuel Lemos. This is a complete HTTP client written in PHP. To use digest authentication you also need sasl which is also written by Manuel Lemos. Moriarty looks for those two classes and uses them if it finds them otherwise it falls back to using curl as before.

To use httpclient with Moriarty you just need to ensure that http_class and sasl_interact_class are loaded before using any HTTP actions. Adding lines like the following to your index.php (or somewhere similar) should do the trick:

    require_once '/path/to/moriarty/lib/httpclient/http.php';

    require_once '/path/to/moriarty/lib/sasl/sasl.php';

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki or visit its Google Code project

Moriarty Now Hosted on Google Code

A couple of weeks ago I moved Moriarty from my playground area of the n² SVN repository to a new project at google code. This brings the advantage of neat issue tracking and code review capabilities as well as better management of contributors and collaborators. The new SVN repository is now http://moriarty.googlecode.com/svn/trunk/ (with an interactive view too). Just email me {at} iandavis.com if you’d like to be added to the project.

Moriarty Update

After a short break, it’s time for an update to Moriarty. Actually the changes in this version have been under development for several weeks but I wasn’t able to release them until Platform release 13 went live at the beginning of this week. There is one organisational change and functional changes, one of which is a major addition. The notes in this blog post relate to revision 679 in Moriarty’s subversion project

Firstly constants.inc.php has been deprecated in favour of moriarty.inc.php which has less of a potential name clash. constants.inc.php is now just a shell that includes moriarty.inc.php so no code should break. However you should update your applications to include moriarty.inc.php because in some future release I shall be removing constants.inc.php entirely.

This renaming is in preparation for a wider breaking change that I would like to make. Because PHP has traditionally had no namespacing capability the community has adopted library naming conventions to avoid name conflicts. For example, classes in Konstruct are prefixed with k_ (like k_Document) and classes in ARC are prefixed with ARC_ (e.g. ARC2_RDFXMLParser). Moriarty doesn’t do this which leads to a higher chance of naming clashes with client code. The right thing to do in a future release is to rename all the classes. So instead of Store we might have MORIARTY_Store or M_Store. I’d like some feedback on what you prefer so please do comment on this post.

It’s worth remembering that moriarty.inc.php defines the MORIARTY_DIR constant, setting it to be the directory in which moriarty.inc.php lives (this isn’t new, constants.inc.php used to do this). The preferred way of including Moriarty classes is like this:

require_once '/path/to/moriarty.inc.php';
require_once MORIARTY_DIR . 'store.class.php';

The major piece of new functionality in Moriarty is HTTP caching support. The Platform supports etags and other related caching headers in many places and for a long time I’ve wanted Moriarty to automatically take advantage of these. I added this support over a period of several weeks, refining it and tuning it so that it could work with the minimum of effort on the client developer’s part. Enabling caching in Moriarty is very simple. Just define a constant called MORIARTY_HTTP_CACHE_DIR and set it to be a valid, writable directory. Moriarty will then start using that directory to cache responses from HTTP requests. For example, add something like this at the main entry point of your code:

define('MORIARTY_HTTP_CACHE_DIR', '/var/cache');

Moriarty uses cached etag headers to intercept standard GET requests and turn them automatically into conditional ones. Although it still requires a network transaction, the amount of bandwidth used for a cache hit is very small. This kind of caching is smart. Dumb caching just keeps content for a pre-determined time period and only requests a fresh version when the time period has expired. That means it won’t be aware of any changes in the source until minutes or hours later. This may work well for content that doesn’t change often but causes extreme difficulties for interactive applications that involve updating as well as reading content. Many Platform-based applications use a simple pattern of fetching a current resource description, diffing it with the one entered by a user and generating a changeset to apply to the store. Dumb caching interferes with this by not fetching the true state of the resource description, and to fix it requires close coordination between user-supplied updates and cache invalidation. Conditional GETs avoid this by revalidating the cached content with the source on every request. The result is a slight trade off in performance for better consistency.

If you’re confident that you only need dumb caching then you can switch it on by defining the MORIARTY_HTTP_CACHE_READ_ONLY constant somewhere in your application. Moriarty doesn’t care about the value of this constant, just whether it is defined or not. When this constant is defined Moriarty will use the max-age headers in HTTP responses to determine how long retrieved content should be considered to be fresh for. It intercepts HTTP requests to the Platform and if it finds a fresh cache entry then it will immediately return that without making any network request. If the cache entry is stale then the request proceeds as normal and the entry gets updated with the newly retrieved content. Use this constant when your application is predominantly read-only and you don’t care if content is stale for a few hours.

Moriarty supports one other caching related constant: MORIARTY_HTTP_CACHE_USE_STALE_ON_FAILURE. Define this constant if you want Moriarty to return a cache entry when it can’t communicate with the Platform. This enables your application to continue runng even if there are network problems, a tradeoff of apparent liveness against freshness of content. (I use this constant in my application when I’m developing offline on the train. While I’m on the network I hit a few pages to freshen up the cache and then when I disconnect I can still browse and test the application using cached content.)

One caveat you need to be aware of: the cache files are not encrypted. Avoid using Moriarty’s caching support if you are dealing with private or secure information that you don’t want to be stored unencrypted on a web server file system. I might provide this capability if there is demand.

Finally, this version of Moriarty includes support for the new describe service. This was included as part of release 13 of the Platform and is now the preferred way of obtaining resource descriptions from the metabox or a private graph. See the section labeled “GET” in the Metabox documentation. You can use it in Moriarty like this:

$store = new Store('http://api.talis.com/stores/mystore');
$mb = $store->get_metabox();
$response = $mb->describe('http://example.com/foo');

The best thing is that the new describe service supports etags for resource descriptions, which means that Moriarty’s new caching functions can really speed up applications that use describe heavily (and if you’re building open world applications, then you should be). The Platform’s SPARQL services don’t currently support etags, so caching is less efficient. To support efficient HTTP caching we’d need to determine whether the resultset has changed since the last time the client issued the query. The only way to do that in SPARQL is by executing the query which could be very cheap or it could be horrendously expensive. The describe service packages up a very common use of SPARQL into a constrained service that is very easy to relate to changes in the underlying graph. That means we can really optimise this service, provide decent caching support and generally boost performance a lot more easily than we can for arbitrary SPARQL queries. Expect us to expand on the describe service in future releases and also to bite off a few other constrained derivatives of SPARQL.

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Moriarty Version 1.0

Tonight Moriarty turns 1.0. We’re starting to use Moriarty more seriously within Talis so we need some discipline around its development. To help this I’ve formally tagged the current version of Moriarty as 1.0. The intention is that all versions of Moriarty with the same major version number will be backwards compatible, so version 1.5 will be a drop in replacement for 1.0. Version 2.0, however, might see us introduce some breaking changes. We’ll try to avoid that of course but often it’s inevitable

You can download the latest release: moriarty-1.0.tgz or you can check it out of subversion using http://n2.talis.com/svn/playground/iand/moriarty/tags/1.0/. The trunk is still the bleeding edge and can be found here: http://n2.talis.com/svn/playground/iand/moriarty/trunk/

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Breaking Changes for Moriarty

As I alluded to earlier I have made some breaking changes to Moriarty (now in subversion as revision 657). These changes are to the index structure used by SimpleGraph which make it compatible with the RDF/PHP Specification. Most of the effects will be internal to Moriarty but some applications may be using the index directly via the get_index method.

I think these are the last breaking changes needed for the foreseeable future so this is probably going to be version 1.0. More on that and versioning policy in a while.

Specifically the changes to the index structure are:

  • The val key is renamed to value
  • The dt key is renamed to datatype
  • The type key now takes values of uri | bnode | literal instead of iri | bnode | literal

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Moriarty Facets

The latest batch of changes to Moriarty made it into subversion at the end oflast week (svn revision 655). The main change is the addition of a new FacetService class. You use it in the usual way. Either indirectly via the Store:

$store = new Store("http://api.talis.com/stores/mystore");
$fs = $store->get_facet_service();

Or directly if you know its URI:

$fs = new FacetService("http://api.talis.com/stores/mystore/service/facet");

Using the FacetService class is pretty simple: just call the facets method passing in the query, an array of fields to facet on and optionally the number of terms to return for each facet. As usual this method returns an HttpResponse:

$response = $fs->facets('query', array('field1','field2'));
if ($response->is_success()) {
  // do something useful
}
else {
  // mummy...
}

You can parse the XML response using the parse_facet_xml method which returns a nested array of data representing the facet data:

array (
  'field1' => array (
        0 => array ( 'value' => 'term1', 'number' => '5' ),
        1 => array ( 'value' => 'term2', 'number' => '4' ),
        1 => array ( 'value' => 'term3', 'number' => '2' ),
       ),
  'field2' => array (
        0 => array ( 'value' => 'term4', 'number' => '5' ),
        1 => array ( 'value' => 'term5', 'number' => '4' ),
        1 => array ( 'value' => 'term6', 'number' => '2' ),
       ),
) 

If you like living dangerously then you can combine both the previous steps into one using facets_to_array. If an error occurs this method simply returns an empty array:

$facets = $fs->facets_to_array('query', array('field1','field2'));

That’s it. A simple class for a simple but powerful service. You can read more about the Facet Service on the n² wiki.

There are a couple of big changes that I want to make pretty soon and I’m giving a heads up here because they may not be backwards compatible. The version of ARC I’m using is quite out of date (January 2008) so I need to update to the latest version. I’m not sure what that will involve. Maybe it’ll be completely smooth with no significant changes needed.

The second change is needed to make SimpleGraph’s index compatible with our RDF/PHP specification. I can see at least one major breaking change: I need to rename the hash key “val” to “value”. That is a pretty major breakage but I want to make Moriarty compatible with the RDF/PHP spec and with ARC2. I’m going to try and do that very soon.

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

ARC2_IndexUtils plugin

ARC2_IndexUtils is a plugin for Arc providing a few simple functions for processing rdf/json - shaped data:

  • ARC2_IndexUtils::filter() takes 2 parameters: a data array, and an associative array of filters. you might use it like this:
    ARC2_IndexUtils::filter($data, array('property'=>  create_function('$u,$p,$os','return $p=="http://xmlns.com/foaf/0.1/name";'), ))
    

    Which would return a data array with only those statements having http://xmlns.com/foaf/0.1/name in the property position.

  • ARC2_IndexUtils::merge takes a variable length list of parameters, where each parameter is an rdf/json style data array, and merges them into one data array.
  • ARC2_IndexUtils::diff takes a variable length list of parameters, returning a data array consisting only of statements from the first array that didn’t exist in any of the subsequent arrays
  • ARC2_IndexUtils::intersect
    takes a variable length list of parameters, returning a data array consisting only of statements from the first array that also exist in all of the subsequent arrays
  • ARC2_IndexUtils::reify reifies an rdf/php data array (you might use this for creating a changeset, or for making provenance statements about your triples)
  • ARC2_IndexUtils::dereify dereifies an rdf/php data array (reified statements can be hard to read, you might want to dereify them to see what they say more easily)