Subscribe

Archive for the 'Projects' Category

Using Moriarty for Serving Linked Data

Although Moriarty is a general purpose library for building applications with the Talis Platform (and tried and tested in Talis Prism and Talis Aspire) one of the most common uses is simply to provide a browsable interface for linked data held in a Talis Platform store. Typically these scripts take the URI sent by the web browser and use a SPARQL query or the Talis Platform’s describe service to fetch linked data about that URI. They then style that as HTML or send it directly back as RDF. There are a series of technical details they all need to deal with: 303 redirects, content negotiation, converting RDF to HTML etc.

I’ve worked on more comprehensive libraries (e.g. Paget) to manage this kind of publishing but I thought the simple case of fetching and styling the data would make a good example of how to use Moriarty. I spent a bit of time this afternoon putting an example script together based on several I’ve written in the past. You can find the result in the dataspace subdirectory of the examples folder. That subdirectory contains four files:

  • dataspace.php — this is the example script. It contains the logic to fetch the relevant description, handle content negotiation of the best output format and styling the result appropriately. It’s not designed to be called directly, but to be included from a configuration file…
  • index.php — this is an example configuration file. It is designed to be dropped into a web server directory and then intercept all requests beginning with that URI. It contains the configuration describing which Talis Platform store to use, where cache files can be written and where to find Moriarty and ARC2. The last thing it does is to load dataspace.php which then handles the browser request.
  • sample.htaccess — this is a sample .htaccess file for Apache webservers. It redirects all requests via index.php.
  • plain.tmpl.html — this is the default template used to render the HTML views. This can be overridden in the configuration.

Using the example script is simple: just copy index.php to the root directory of your linked data space. If you’re using Apache then you need to copy sample.htaccess into the same directory and rename it to .htaccess. Edit index.php so it refers to your store and your URIs and that’s it! You can see it in action with the default template on my own linked data space.

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is being developed by small community of developers and is in continual beta, subject to a slow stream of updates. You can read more about Moriarty on the n² wiki or visit its Google Code project

Notes on Cross-Domain Ajax

Background

I asked for a little project I could get my teeth into, Leigh suggested something very tasty. An analytics app, along the lines of Google Analytics or the (very impressive) open source Piwik. Basically tracking things like page visits, referers, outbound clicks and so on. The difference from the existing apps being taking advantage of semweb goodness, specifically a Talis Platform store as a backend.

What this required was something that would run in the browser when someone visited a given Web page and pass on relevant data to a server which would push that data into the store. A script discretely embedded on the page of interest picks up the activity and posts it to the server-side logging system. There wasn’t really a sensible choice other than to use Javascript client-side, and to keep things reasonably portable server-side I opted for PHP. The server-side processing is relatively straightforward (although I’m not actually capturing much yet), but the browser-server comms part turned out to be a real doozy.

It’s not difficult to call a HTTP server from inside Javascript wrapped in HTML loaded in a browser. The snag is that the security model common to popular browsers blocks access to server domains other than the one that originated the page containing the Javascript. I got some code running from http://hyperdata.org that nicely delivered some basic logging of visits to pages on http://hyperdata.org (including the Wiki I have there – though it took a while to find the right template…). Problems started when I tried the same script in pages hosted under http://danny.ayers.name. Browser no likey, wrapping the server call in a try...catch block and throwing up an alert(error) always revealed Exception… “Access to restricted URI denied” code: “1012″ – this is the same origin policy. What follows are the workarounds for this. Googling the titles here will provide a variety of sample code that implements the solutions. I’ve opted for Hidden Form, it being straightforward for my purposes and standards-friendly.

Cross-domain proxy

Conceptually the easiest, this approach uses a server-side pipeline that lives on the same domain as the delivered pages containing the Javascript. It essentially echos calls from the delivering server to the remote server that does the work. This didn’t seem a good choice for the analytics app as every end-user would require such a proxy on their own server.

  • Pros: straightforward; independent of browser vagaries; spec friendly
  • Cons: needed for every host delivering pages with embedded scripts (if all the servers involved are yours, this is probably a good choice)

Tag Overload Hacks

When a typical browser hits HTML tags <script> and <img> (any others?) it will quite happily do a HTTP GET on them, irrespective of domain. There’s been a fair bit of finesse applied around the use of <script> – notably the elegant but brain-boiling JSONP (JSON with Padding) which passes around scripts padded to be non-executable and involves callbacks. Somehow. I won’t comment further on this, except to say I understood it for about 5 minutes then lost it again when I went to make a coffee. I’m told jQuery will do something similar automagically if you choose datatype: "json" and method: "get".

The <img> approach has been around seemingly forever – it’s also known as a Web Bug. Usually you have a 1×1 pixel image in the page of interest (probably inserted dynamically through DOM calls), every time the page is loaded that image’s URI gets a GET. The trick for tracking is to append the image URI with a bunch of query parameters and have your server intercept the GET call. Apparently this is how Google Analytics does its stuff.

  • Pros: good library support
  • Cons: limited to GETs; rather an ugly hack

Flash Proxy

Most people suggested this when I was asking around Twitter and the jQuery mailing list. Turns out there’s a really convenient library that does all the hard work (Google “flXHR”). But I’m afraid I prefer to give Flash a miss when there are open standards available, so I didn’t investigate.

  • Pros: easy (apparently) with library support
  • Cons: uses proprietary stuff

Hidden Form

When I first saw references to this I overlooked it – it seemed to demand an iFrame and ugly hackery. But then (largely thanks to this discussion of cross-domain Ajax) I realised it was almost certainly the best bet for the analytics app. Essentially you dynamically push a <form> into the HTML DOM with your data as input values, then call a form.submit(). Most references to this I found did involve an iFrame to receive the HTTP response – necessary if you’re doing a mashup or something, but not if you only need to POST data off to the server. In this latter circumstance you need to get the server to return a 204 No Content status code, but that’s trivial in PHP (header('HTTP/1.1 204 No Content');), otherwise the browser will try to load the target URI material.

  • Pros: supports and is very simple for POSTing to server; standards-friendly in this context
  • Cons: gets uglier if you want a response

I’ve not properly doc’d my app code yet (and the functionality is a very long way from complete, let alone tidied up), but you can find it all via my latest Wiki – there’s an example of the Javascript in test.html (just before the closing </body> tag). I’ve only tested it on Firefox so far, but I reckon there’s a good chance of the LazyWeb giving me solutions to any cross-browser issues.

Many thanks for all the helpful suggestions: from this thread on the jQuery mailing list and Twitterers @rjw @flensed @gridinoc @weblivz @JeniT @jQueryHowto.

I’d love to hear of any other solutions to cross-domain Ajax, please drop in comments, mail me or tweet me.

Moriarty Progess Report

It’s been a while since I wrote about Moriarty, the PHP library I created for working with the Talis Platform. That’s not to say that there have been no changes: on the contrary, there have been lots of improvements and some major new areas of functionality. I’m going to summarise them in this post and then, time permitting, follow up with more detailed posts on particular areas.

  • Fresnel Selector Language — This is a major new addition to Moriarty. A new class called GraphPath has been added which implements almost all of the Fresnel Selector Language specification. I’ve been interested in RDF path languages for a long time and FSL now appears to be the strongest contender. Currently this is a stand-alone class, but after a few more cycles of testing it would be nice to add a convenience method to SimpleGraph to allow selection of resources using paths.
  • Zend-compatible caching — I did a substantial refactoring a while back to convert Moriarty’s HTTP caching implementation to be compatible with Zend’s cache interfaces. Whereas before you were limited to caching HTTP responses to disk, you can now supply any of Zend’s built-in cache classes to enable caching in databases, memcached and many other systems. You do this by creating an instance of HttpRequestFactory with your cache class and then pass the factory to the Store class.
  • JSON usageRelease 24 of the Talis Platform introduced RDF/JSON serialisation for describe and constuct SPARQL queries. Moriarty now requests this format where it and because the SimpleGraph class uses an RDF/JSON structure as its index often there is no result parsing involved at all.
  • OAI Service Support — OAIService is a new class that represents a store’s OAI-PMH Service. It provides simple access to the OAI service allowing all resources in a store to be listed.
  • Automated Builds — We have added Moriarty to an Hudson server which monitors the subversion repository and runs the unit test suite after every checkin. It’s not ideal because the server is not accessible to users outside of Talis (come on Google Code – we need Hudson support!). However it adds an extra level of confidence to checkins because test failures are emailed out to moriarty-dev@googlegroups.com which is open for anyone to join. A few times in the past we have run the unit tests locally and then forgotten to check in some critical dependency so the subversion trunk contains a broken build. Hudson will alert us to these kinds of errors much more quickly.
  • Extended Describes — The Sparql classes now accept an extra parameter to their describe methods that allows you to specify the type of description you want. By default you get the Platform’s default graph which is a list of triples that have the subject you specify (no bnodes remember!). Moriarty allows you to easily request other types such as symmetric bounded description (triples where the URI being described is subject or object), labelled bounded description (like the default description plus the addition of label properties for URIs in the description set) and symmetric labelled bounded description (a combination of the previous two). See the Bounded Description page on the n2 wiki for more information.
  • Richer Store Interface — Up until now Moriarty has used a object model that closely follows the Platform’s separation of services. That tended to make code using Moriarty quite verbose. We’re now gradually introucing convenience methods onto the Store class so common operations can be accessed with less code.

The moriarty Google Code project now has several committers although Keith and myself are still the most prolific. However, having multiple committers is one more step away from this being a personal project and towards it being community owned. Moriarty is being used in lots of small projects in and around Talis, but significantly it is also in the core of two of our most important products: Talis Prism and Talis Aspire. That’s great validation for Moriarty, although it brings a lot more responsibility in terms of quality of testing. I now consider Moriarty to be out of continual alpha and into continual beta!

About Moriarty… Moriarty is a simple PHP library for accessing the Talis Platform. It follows the Platform API very closely and wraps up many common tasks into convenient classes while remaining very lightweight. It also provides some simple RDF classes that are based on the excellent ARC2 class library. Moriarty is being developed by small community of developers and is in continual beta, subject to a slow stream of updates. You can read more about Moriarty on the n² wiki or visit its Google Code project

Pho: A Ruby Client for the Talis Platform

This is a short blog post to announce a project I’ve been working on in my spare time. Pho, is a Ruby client for the Talis Platform. Its hosted on Rubyforge so getting started couldn’t be easier:

gem install pho

Will download and install the Pho gem, along with the documentation which you can also read online.

The distribution comes with a couple of example scripts that show how to add items to the Content Box, perform SPARQL queries, check status of a store, etc.

However the API currently does a lot more than that giving you full access to all of the core Platform services including: storing binary data & RDF metadata, SPARQL queries, faceted browsing, job control, store configuration options, etc.

There’s still plenty of work to be done but at version 0.3 I think there’s enough functionality available that you can build useful applications using the API. For example there’s sufficient code there now to use the library to script some simple data management activities for publishing or managing data in the Platform, or to build a simple linked data browser. I hope to post examples of doing exactly those things over the next few weeks, as I’m planning some updates to my space data store (briefly described here) that will be handled using Pho.

The next steps are to plug some of the gaps in the API  — specifically parsing of search results, access to the job metadata we exposed in version 21, and better support for changeset management. I’m also going to explore some simple Ruby-RDF mapping functionality. The latter should help turn Pho into something that can provide the core functionality required for building linked data backed applications.

I’d love to get feedback on this, so feel free to post bug reports or feature suggestions either on the Rubyforge project or the n2-dev mailing list.

Metamorph Open Source project for Semantic Converter Web Service

I’ve published the code behind the Talis Convert Service (production release at stable URL coming soon) as an open source project on Google Code, called Metamorph .

Metamorph is a service aimed at semantic web developers. It is much like triplr, babel, swignition and any23 (please leave a comment pointing to any other similar services).

You give it a(n http) URI, an (optional) input format, and an output format, and it will fetch the document from the web, and convert it into the output format.

Understood input values include:

  • Semantic HTML (RDFa, eRDF, microformats, POSH)
  • RDF (XML, Turtle, JSON)
  • SPARQL-XML
  • Facet XML (the response format of the facets service available on all platform stores)

Output for all input formats can be:

  • JSON
  • JSONP
  • HTML

If the input is some form of RDF, you can also ask for:

  • RDF (XML, Turtle, JSON, – and the default HTML is rendered as RDFa)
  • RSS 1.0
  • TriX
  • Exhibit (web page, JSON, JSONP)

In addition, if the input is an RDF format, you can specify multiple data URIs, and the results will be merged in the output document. For instance, this conversion merges data from two of my homepages, and a Turtle file.

I’m thinking about removing the TriX output, as I’m not sure it would be used by anyone – the reason I didn’t bother to write a parser for it was because I haven’t seen any data published as TriX in the first place.

I welcome any input on what else would be useful from this web service. I suspect that more output options, while fairly easy to add, would not be very useful. More input options may be useful, but perhaps not significantly so.

I suspect what might be more useful, and more likely to distinguish this from similar RDF converter services, are graph transformation services, which might include:

  • Diffs
  • Intersects
  • Smushing
  • Augmenting on property and class type URIs with labels and comments, perhaps retrieved from SchemaCache

Metamorph is coded in PHP, and uses ARC for parsing RDF and HTML, and serialising RDF/XML and Turtle.

Please use the issue tracker for raising any bugs or feature requests.

Moriarty Release 1.1

After some nudging from the Talis development team I tagged the current trunk of Moriarty as version 1.1:

http://moriarty.googlecode.com/svn/tags/release-1.1/

This is a stable release and should be backwards compatible with 1.0. The trunk continues to be the bleeding edge.

Moriarty Documentation

I started adding some API documentation to Moriarty using the excellent PHPDoctor. The documentation is in subversion but you can also view it online.

Exploring OpenLibrary Part Two

More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.

Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.

First off, a correction – foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.

It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.

In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.

So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…

“^[0-9]{1,4}$” – A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.

I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.

The model we end up with is like this


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL149323A>
	foaf:Name "Schaller, Heinrich";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
	bio:event <http://example.com/a/OL149323A#birth>;
	a foaf:Person .

<http://example.com/a/OL149323A#birth>
	dc:date <http://example.com/a/OL149323A#birthDate>;
	a bio:Birth .

<http://example.com/names/schallerheinrich>
	mine:name_of <http://example.com/a/OL149323A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1900>
	time:unitType time:unitYear;
	time:year "1900";
	a time:DateTimeDescription .

<http://example.com/a/OL149323A#birthDate>
	time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
	a time:Instant .

The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…

First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL151554A>
	foaf:Name "Altdorfer, Albrecht";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
	bio:event <http://example.com/a/OL151554A#birth>;
	bio:event <http://example.com/a/OL151554A#death>;
	a foaf:Person .

<http://example.com/a/OL151554A#birth>
	dc:date <http://example.com/a/OL151554A#birthDate>;
	a bio:Birth .

<http://example.com/a/OL151554A#death>
	dc:date <http://example.com/a/OL151554A#deathDate>;
	a bio:Death .

<http://example.com/names/altdorferalbrecht>
	mine:name_of <http://example.com/a/OL151554A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1480>
	time:unitType time:unitYear;
	time:year "1480";
	a time:DateTimeDescription .

<http://example.com/a/OL151554A#birthDate>
	mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
	a time:Instant .

Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.

Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.

Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ – 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.

In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.

Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.

Store Admin Interface

If you have a Talis store, or even if you’re just interested in browsing around existing talis stores, you might be interested in an admin interface  I’ve been working on.

Once you have selected a store, you can browse resources by type (rdf:type), search across the contentbox index, edit resources, view pending jobs and send new ones, import data, and configure the field-predicate mapping for your stores.

Please send bug reports and feature requests to keith dot alexander at talis.com

If you do want a talis store, just ask in #talis on irc.freenode.net, or email danny dot ayers  at talis.com

Moriarty Development List

I noticed that I was the only one getting notificiations of commits to Moriarty‘s subversion. I thought the best way to fix that was to create a Google group for moriarty and ensure the commit reports get sent there. So if you’re interested in keeping track of changes to Moriarty please sign up: moriarty-dev