Subscribe

Importing Large RDF Documents: Streaming Parsing of RDF/XML with ARC2

A common trouble when parsing RDF is running out of memory because the document is too large. ARC2 solves this problem (for RDF/XML) by being able to stream it.

If you want to take advantage of the streaming, you just need to extend the ARC2_RDFXMLParser class and overwrite the addT method:

<?php
require 'arc/ARC2.php';
require 'arc/parsers/ARC2_RDFXMLParser.php'; 

class Streamer extends ARC2_RDFXMLParser { 

	function addT($s, $p, $o, $s_type, $o_type, $o_dt = '', $o_lang = ''){
		var_dump($s, $p, $o, $s_type, $o_type, $o_dt, $o_lang);
	} 

} 

$p = new Streamer(); 

$p->parse('big-data.rdf'); 

?>

In this simple example, I’m just var_dumping out the triples as they come in, but of course you should do whatever it is you want to do instead to the triple in that method.

Moriarty Update

After a short break, it’s time for an update to Moriarty. Actually the changes in this version have been under development for several weeks but I wasn’t able to release them until Platform release 13 went live at the beginning of this week. There is one organisational change and functional changes, one of which is a major addition. The notes in this blog post relate to revision 679 in Moriarty’s subversion project

Firstly constants.inc.php has been deprecated in favour of moriarty.inc.php which has less of a potential name clash. constants.inc.php is now just a shell that includes moriarty.inc.php so no code should break. However you should update your applications to include moriarty.inc.php because in some future release I shall be removing constants.inc.php entirely.

This renaming is in preparation for a wider breaking change that I would like to make. Because PHP has traditionally had no namespacing capability the community has adopted library naming conventions to avoid name conflicts. For example, classes in Konstruct are prefixed with k_ (like k_Document) and classes in ARC are prefixed with ARC_ (e.g. ARC2_RDFXMLParser). Moriarty doesn’t do this which leads to a higher chance of naming clashes with client code. The right thing to do in a future release is to rename all the classes. So instead of Store we might have MORIARTY_Store or M_Store. I’d like some feedback on what you prefer so please do comment on this post.

It’s worth remembering that moriarty.inc.php defines the MORIARTY_DIR constant, setting it to be the directory in which moriarty.inc.php lives (this isn’t new, constants.inc.php used to do this). The preferred way of including Moriarty classes is like this:

require_once '/path/to/moriarty.inc.php';
require_once MORIARTY_DIR . 'store.class.php';

The major piece of new functionality in Moriarty is HTTP caching support. The Platform supports etags and other related caching headers in many places and for a long time I’ve wanted Moriarty to automatically take advantage of these. I added this support over a period of several weeks, refining it and tuning it so that it could work with the minimum of effort on the client developer’s part. Enabling caching in Moriarty is very simple. Just define a constant called MORIARTY_HTTP_CACHE_DIR and set it to be a valid, writable directory. Moriarty will then start using that directory to cache responses from HTTP requests. For example, add something like this at the main entry point of your code:

define('MORIARTY_HTTP_CACHE_DIR', '/var/cache');

Moriarty uses cached etag headers to intercept standard GET requests and turn them automatically into conditional ones. Although it still requires a network transaction, the amount of bandwidth used for a cache hit is very small. This kind of caching is smart. Dumb caching just keeps content for a pre-determined time period and only requests a fresh version when the time period has expired. That means it won’t be aware of any changes in the source until minutes or hours later. This may work well for content that doesn’t change often but causes extreme difficulties for interactive applications that involve updating as well as reading content. Many Platform-based applications use a simple pattern of fetching a current resource description, diffing it with the one entered by a user and generating a changeset to apply to the store. Dumb caching interferes with this by not fetching the true state of the resource description, and to fix it requires close coordination between user-supplied updates and cache invalidation. Conditional GETs avoid this by revalidating the cached content with the source on every request. The result is a slight trade off in performance for better consistency.

If you’re confident that you only need dumb caching then you can switch it on by defining the MORIARTY_HTTP_CACHE_READ_ONLY constant somewhere in your application. Moriarty doesn’t care about the value of this constant, just whether it is defined or not. When this constant is defined Moriarty will use the max-age headers in HTTP responses to determine how long retrieved content should be considered to be fresh for. It intercepts HTTP requests to the Platform and if it finds a fresh cache entry then it will immediately return that without making any network request. If the cache entry is stale then the request proceeds as normal and the entry gets updated with the newly retrieved content. Use this constant when your application is predominantly read-only and you don’t care if content is stale for a few hours.

Moriarty supports one other caching related constant: MORIARTY_HTTP_CACHE_USE_STALE_ON_FAILURE. Define this constant if you want Moriarty to return a cache entry when it can’t communicate with the Platform. This enables your application to continue runng even if there are network problems, a tradeoff of apparent liveness against freshness of content. (I use this constant in my application when I’m developing offline on the train. While I’m on the network I hit a few pages to freshen up the cache and then when I disconnect I can still browse and test the application using cached content.)

One caveat you need to be aware of: the cache files are not encrypted. Avoid using Moriarty’s caching support if you are dealing with private or secure information that you don’t want to be stored unencrypted on a web server file system. I might provide this capability if there is demand.

Finally, this version of Moriarty includes support for the new describe service. This was included as part of release 13 of the Platform and is now the preferred way of obtaining resource descriptions from the metabox or a private graph. See the section labeled “GET” in the Metabox documentation. You can use it in Moriarty like this:

$store = new Store('http://api.talis.com/stores/mystore');
$mb = $store->get_metabox();
$response = $mb->describe('http://example.com/foo');

The best thing is that the new describe service supports etags for resource descriptions, which means that Moriarty’s new caching functions can really speed up applications that use describe heavily (and if you’re building open world applications, then you should be). The Platform’s SPARQL services don’t currently support etags, so caching is less efficient. To support efficient HTTP caching we’d need to determine whether the resultset has changed since the last time the client issued the query. The only way to do that in SPARQL is by executing the query which could be very cheap or it could be horrendously expensive. The describe service packages up a very common use of SPARQL into a constrained service that is very easy to relate to changes in the underlying graph. That means we can really optimise this service, provide decent caching support and generally boost performance a lot more easily than we can for arbitrary SPARQL queries. Expect us to expand on the describe service in future releases and also to bite off a few other constrained derivatives of SPARQL.

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Moriarty Version 1.0

Tonight Moriarty turns 1.0. We’re starting to use Moriarty more seriously within Talis so we need some discipline around its development. To help this I’ve formally tagged the current version of Moriarty as 1.0. The intention is that all versions of Moriarty with the same major version number will be backwards compatible, so version 1.5 will be a drop in replacement for 1.0. Version 2.0, however, might see us introduce some breaking changes. We’ll try to avoid that of course but often it’s inevitable

You can download the latest release: moriarty-1.0.tgz or you can check it out of subversion using http://n2.talis.com/svn/playground/iand/moriarty/tags/1.0/. The trunk is still the bleeding edge and can be found here: http://n2.talis.com/svn/playground/iand/moriarty/trunk/

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Talis Store Plugin for ARC

The PHP coders amongst you may be interested in a Talis Store Plugin. To install it:

cd arc/plugins #yoru ARC plugins directory

svn co http://n2.talis.com/svn/playground/kwijibo/PHP/arc/plugins/trunk/talis/ talis
svn co http://n2.talis.com/svn/playground/kwijibo/PHP/arc/plugins/trunk/ARC2_SPARQLSerializerPlugin/ARC2_SPARQLSerializerPlugin.php ARC2_SPARQLSerializerPlugin.php

Then to use it:

require_once '../ARC2.php';   

/* configuration */
$talis_config = array(
  // 'db_user' => 'your_username',
  // 'db_pwd' => 'your_password',
  'store_name' => 'kwijibo-dev3', // your store name
   'fetch_graphs' => false, // If set to true, using FROM will fetch the graph as a datasource over the web, and store it in /meta
);
$store = ARC2::getComponent('Talis_StorePlugin', $talis_config);
$store->query("LOAD ")

What this does is let you use a Talis store instead of the ARC mysql store. It supports a subset of ARC’s SPARQL+ functionality. Specifically, it supports INSERT and DELETE (which I could translate to Changesets thanks to Benji’s SPARQL parser), but not the aggregate functions (which I don’t see a way to support in a client-layer at this point).

Some differences:

Named Graphs are currently a bit different in Talis stores – you can’t (yet) create your own on the fly as you can with ARC, so LOAD will put the data into the public graph by default.

Talis platform transforms bnodes into URIs, so .

I also added a few methods to the api:

$store->import($arc_store);
$store->export($arc_store);

(The idea is that you can move data between an ARC store and a Talis store).

I also added a $store->change($before_rdf, $after_rdf) method for submitting changes to an RDF graph.

It’s quite interesting comparing the two different ways of making changes (changesets and SPARQL+). I think that changesets (especially with the coming Batch Changeset support) are maybe a bit more amenable to programmatic resource updates from forms and the like. However, changesets are a bit verbose to hand-write for making quick edits and testing stuff, or pattern-based changes, and I’m finding SPARQL+ really handy for stuff like this.

What I’ve been thinking would be pretty neat would be if the SPARQL parser could be a bit more user extensible, and pre-query hooks could be set up (like ARC’s triggers, which happen post-query), so that plugin/hook writers could extend the SPARQL functionality, or just do stuff pre-query. Use cases might include:

  • rewriting SPARQL for performance improvement, or access control
  • pre-fetching data from FROM graphs over the web and adding it to the store (you can set a ‘fetch_graphs’=> true parameter in the config array you set up the talis store with, and it will do this)
  • adding versioned changesets to the ARC store
  • inventing new keywords – eg: ABOUT <http://example.org/foo> could be rewritten to DESCRIBE ?s WHERE {{ ?s rdf:subject <http://example.org/foo> } UNION {?s cs:subjectOfChange <http://example.org/foo> } } – Similarly you could add syntactic support for rollbacks, transactions, updates

You can see more usage examples at: http://n2.talis.com/svn/playground/kwijibo/PHP/arc/plugins/trunk/talis/Talis_StorePlugin.demo.php

Drupal and the opportunity of RDF

At the start of this week, Dries Buytaert presented the keynote presentation at DrupalCon 2008 . The most exciting revelation came at the end: Drupal’s future is in the semantic web..

While Dries talks about the semantic web, and RDF, you don’t hear much reaction from the crowd; but then he says Let me show you a video of the future And proceeds to demonstrate SPARQLing on linked data from sources like dbpedia dbtunes, geodata, events, friends lists, and google spreadsheets, mashed-up in Exhibit.

This gets a lot of applause :)

In the keynote, he puts emphasis on data interoperability, decentralisation, remote querying, and how having a lot of data is great fun :)

It’s a really great talk, with a lot of excellent quotes about the value of RDF for Drupal, here are some of my favourites:

Web 3.0 (much as I hate to use the term) is all about infinite interoperability

We have the opportunity to be mentioned in the history books of the web … This is where the web is going. And this right time, and the right place, to make it happen.

Using RDF you can connect all these different parts of data, that live in different parts of the web.

RDF turns the web into a database

The real opportunity we have here is to start sprinkling this map [of linked open data sources] with Drupal. Every single Drupal site can be an RDF repository that people can query

Google are trying to build a world social graph, connecting people … but what we are doing with RDF is connecting not just people, but everything

With RDF, the import/export problem we have in Drupal just goes away. It just works, without having to describe database schemas… It just works. It’s a problem that is already solved.

You can listen to the audio of the presentation at archive.org (~45MB – the RDF stuff starts at around 53 minutes), and view a video of the RDF demonstration

You can also read more about Drupal and RDF here

ARC2_IndexUtils plugin

ARC2_IndexUtils is a plugin for Arc providing a few simple functions for processing rdf/json – shaped data:

  • ARC2_IndexUtils::filter() takes 2 parameters: a data array, and an associative array of filters. you might use it like this:
    ARC2_IndexUtils::filter($data, array('property'=>  create_function('$u,$p,$os','return $p=="http://xmlns.com/foaf/0.1/name";'), ))
    

    Which would return a data array with only those statements having http://xmlns.com/foaf/0.1/name in the property position.

  • ARC2_IndexUtils::merge takes a variable length list of parameters, where each parameter is an rdf/json style data array, and merges them into one data array.
  • ARC2_IndexUtils::diff takes a variable length list of parameters, returning a data array consisting only of statements from the first array that didn’t exist in any of the subsequent arrays
  • ARC2_IndexUtils::intersect
    takes a variable length list of parameters, returning a data array consisting only of statements from the first array that also exist in all of the subsequent arrays
  • ARC2_IndexUtils::reify reifies an rdf/php data array (you might use this for creating a changeset, or for making provenance statements about your triples)
  • ARC2_IndexUtils::dereify dereifies an rdf/php data array (reified statements can be hard to read, you might want to dereify them to see what they say more easily)

Ask Moriarty?

Another day, another incremental improvement to Moriarty (svn revision 490)! After my last set of changes I thought I’d better hurry up and add the copy_to function to the FieldPredicateMap too. You can now clone Field/Predicate Maps from one store to another:

  $fp = new FieldPredicateMap("http://api.talis.com/stores/mystore/config/fpmaps/1");
  $response = $fp->get_from_network();
  if ( $response->is_success() ) {
    $new_fp = $fp->copy_to("http://api.talis.com/stores/otherstore/config/fpmaps/1");
    $new_fp->put_to_network();
  }

I then set about thinking through my plan for adding HTTP caching support to Moriarty. I want this to work automatically and transparently, taking advantage of conditional GETs on the Platform. I’ll let it be switched off by defining a constant but I want it to be there by default so the developer gets the benefit without any effort.

I stubbed out some initial ideas for the HttpCache class on the train this morning. Then at lunchtime today, Danny pinged me on IRC wondering why Moriarty didn’t have SPARQL ASK support. “Not by design”, I said, “more by lack of time. But it should be easy to add, give me 15 minutes”. Then I promptly went into a series of meetings that ate the rest of my day. In the end the code did only take 15 minutes, but I finished it 11 hours later than I expected. Hopefully Danny didn’t spend all that time waiting for me to respond on IRC :-)

You can perform an ASK query on a store like this:

  $store = new Store("http://api.talis.com/stores/mystore");
  $sparql = $store->get_sparql_service();
  $response = $sparql->ask( "ASK WHERE {?s a .}" );
  if ($response->is_success()) {
    $result = $sparql->parse_ask_results( $response->body);
  }

Enjoy, Danny!

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps ups many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Styles of Web Application – FlowPHP

Ian blogged a while back about why MVC is a rubbish pattern for web development because it doesn’t describe the problem in a way that helps you understand it better. I completely agree, and it’s surprising how much “received wisdom” there is about MVC being the right way to do things, but the natural response is, Well, what isn’t a rubbish pattern then?

Someone asks that in the comments on the blog post, and Ian replies:

Doesn’t REST define the pattern you need: resource/representation? Your application uses the URI to locate the appropriate resource and asks it to produce the appropriate representation.

I’m not completely happy with that as an answer though. To me, REST defines the interface to your application, and while it helps define at least that part of the problem, it doesn’t really give you enough of a solution. It doesn’t help you decide how to structure your code in the same way that MVC does (even if that decision is ultimately suboptimal).

I’ve been writing web apps in a similar style to that used by RESTful frameworks like Tonic and web.py, which I guess could be described as what rsinger called “_VC” on #talis the other day. Basically you have different ‘Resource’ classes that map to your application’s url design and return representations when, eg, a GET, or a POST method is called on them. A great boon of developing with RDF is that, because all data is the same shape, you can do things pretty generically, and write less domain-specific code. So I tried to keep my resource classes as generic as possible, and have different url routes set up the classes with different parameters as need be.

However, I’ve been growing pretty dissatisfied with this way of doing it, because it still seemed to be obscuring too much of the problem for me conceptually. There was still a problem of, ‘OK, where is the best place to put this‘, and a constant tension between whether to try to extend a generic class to cope with another situation, or writing a new one to do what you want. So I’d end up with a lot of classes that did a lot of pretty similar things (retrieving SPARQL queries, parsing them, passing data to the template), but not similar enough to be able just to do it with one class. I also found that class inheritance was a slightly messy way to share functionality, and it could be annoying to try to remember which class was used for which url space, and look it up in the routing configuration, and it wasn’t very amenable to serving representations derived from a combination of data sources.

So the other day I had an idea for a different style, which I’m pretentiously code-naming ‘FlowPHP’ (pronounced floaf – the P is silent ;) ).

The motivation is to try to model the process of receiving a request and returning a response as a chain of modular bits of code that create a response from the incoming request, and filter it until it is served. I’ve been trying this idea out, and so far, it looks like this:


try{
$KwijiboDev1 = new Store('http://api.talis.com/stores/kwijibo-dev1');
$R = new Request(array('SERVER' => $_SERVER, 'GET' => $_GET));
switch(true):
	case $R->is('GET','/posts'): // method is GET and url is /posts
		$R->response()->
                        checkCache()->
                            RDFList($KwijiboDev1, SIOC.'Post')->
                                SmushGraph()->
                                   serve('posts','main');
		break;
	case $R->is('GET','/post', array('uri')):
		 $R->response()->
                            checkCache()->
                                 CBD($KwijiboDev1, $R->GET['uri'])->
                                    serve('post','main');
		break;
	default:
		throw new HTTP_404("Page could not be found");
endswitch;
}
catch (Exception $e){
	echo $e->serve('error','main');
}

So what this is doing, is:

  • building a Request object with data from the $_SERVER and $_GET variables.
  • Checking the HTTP REQUEST METHOD, the REQUEST URI, and (optionally) for the existence of any required parameters.
  • Processing the Request and serving a Response by:
    1. Checking for a cached version we could serve first
    2. Retrieving the data: eg, CBD
    3. processing the data (eg: SmushGraph)
    4. Serving it in templates (serve() takes a variable length list of templates as parameters, rendering each inside the next template in the list)
  • Responding with an appropriate error if necessary (eg, HTTP 404, 405, 406, 500 – I pinched the idea of modelling 4xx and 5xx as Exceptions from Konstrukt)

Each ‘method’ in the chain, up until ‘serve()’, is returning the altered response object for the next method to manipulate. The methods that deal with adding data to the response, doing stuff with data, etc, aren’t really methods at all, but dynamically-called functions from a separate file. The reason I did it like this is I think it might be more modular and extensible, whilst not necessitating the creation of lots of different subclasses of Response.

This is still all evolving of course, and some/all of the ideas might turn out to be rubbish, but the thing I’m liking so far is the transparency: I think it’s relatively easy to see what’s going on with the code – what happens where, and when. The thing I’m experimenting with, I suppose, is the level of abstraction – my previous approach was perhaps too high-level and inflexible, which resulted in either lots of code, or lots of configuration, and the routing was kept too separate from the logic of returning the response.

The particular tension I’m finding with trying to develop flowphp at the moment, is to find a good idiom for setting variables midway through the chain of events – I’m loathe to have to break out of the chained methods, but maybe that’s only for aesthetic reasons.

Query Profiles in Moriarty

I just committed another batch of changes to Moriarty (svn revision 482). This version contains some important changes to the way classes are included (thanks to prompting by kwijibo on #talis over the weekend). Previously Moriarty assumed that your classes were in directories in your include path. Now Moriarty expects its classes to reside in the directory defined by MORIARTY_DIR. If this isn’t already defined then Moriarty will define it to be the same directory as that containing constants.inc.php. A similar constant MORIARTY_ARC_DIR defines the directory where Moriarty expects to find ARC2.php. If this isn’t set then it will assume ARC is in a sibling directory. Take a look at constants.inc.php for the logic.

I also added support for query profiles which control the relative weights applied to each field in a text search. In Moriarty this class is a NetworkResource so you can easily populate the object by getting it from the network:

  $qp = new QueryProfile("http://api.talis.com/stores/mystore/config/queryprofiles/1");
  $response = $qp->get_from_network();
  if ( $response->is_success() ) {
    // do something with qp...
  }

Setting a query profile for a store is also quite easy. This example shows how to create a new query profile, set some field weights and then save it to the Platform:

  $qp = new QueryProfile("http://api.talis.com/stores/mystore/config/queryprofiles/1");
  $qp->add_field_weight('name', '2.0'); // the name field is twice as important than average
  $qp->add_field_weight('comments', '0.5'); // the name field is half as important as average
  $response = $qp->put_to_network();
  if ( $response->is_success() ) {
    // do something with qp...
  }

You can also remove field weights and replace them with alternate ones:

  $qp = new QueryProfile("http://api.talis.com/stores/mystore/config/queryprofiles/1");
  $response = $qp->get_from_network();
  if ( $response->is_success() ) {
    $qp->remove_field_weight('comments');
    $qp->add_field_weight('comments', '3');
  }

Finally, I added a utility function to assist when copying a query profile from one store to another. It recalculates all the URIs so they apply to the new store rather than the old one. Here’s how you could use it to clone a query profile from one store to another:

  $qp = new QueryProfile("http://api.talis.com/stores/mystore/config/queryprofiles/1");
  $response = $qp->get_from_network();
  if ( $response->is_success() ) {
    $new_qp = $qp->copy_to("http://api.talis.com/stores/otherstore/config/queryprofiles/1");
    $new_qp->put_to_network();
  }

You might be wondering why I chose the long method names get_from_network and put_to_network over shorter ones like load or save? The reason is that I strongly believe that it’s wrong to hide the network from the application. One of Moriarty’s principles is that it is the thinnest wrapper around HTTP that is possible. That’s why many network operations return the actual response object from the HTTP interaction. The developer can then inspect any headers that the server sends. Naming these methods explicitly reminds the developer that these are network operations and not local ones. The developer needs to be aware because networked applications need to be written differently to those operating on a single machine. Networks have latency, so it’s not wise to be calling these methods a thousand times a second and they are unreliable so the developer needs to be able to handle failure gracefully and be prepared to retry (these are a couple of the 8 Fallacies of Distributed Computing). Moriarty doesn’t try to hide these issues from the developer.

I also added query profile support to the Config class. The get_first_query_profile method is guaranteed to get you the query profile of your store, regardless of its URI. As explained in the FAQ query profiles can exist in a number of locations depending on the store configuration. I worked out the logic for every existing store on the platform, so this code will always get your query profile:

  $store = new Store("http://api.talis.com/stores/mystore");
  $config = $store->get_config();
  $qp = $config->get_first_query_profile();

If you just want the query profile URI then you can call $config->get_first_query_profile_uri()

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps ups many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is primarily being developed by Ian Davis and is in continual alpha, subject to occasional rapid bursts of change. You can read more about Moriarty on the n² wiki and get its source from the n² subversion repository

Experimental Convert Service

Lately I’ve been working on an experimental Convert service. The idea is much like dajobe’s triplr or Simile’s babel – accept a variety of semantic formats as input, and make them available in other flavours as output.

RDF -> RDF

The service accepts HTML (preferrably with eRDF, RDF, or microformats), RDF/XML, turtle, or RDF/JSON as input, outputting to a variety of RDF serialisations. For the parsing of most of these RDF formats, the service uses Benjamin Nowack’s excellent ARC library for PHP.

SPARQL/XML and Facet/XML

The service also accepts SPARQL/XML and the XML from the Talis Platform Facet service, transforming to either JSON, JSONP, or HTML.

Doing the conversions is a PHP library, available in the n2 SVN repository