Subscribe

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) – grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm – too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

(Semantic) Web Agents and OSGi

A little fyi/progress report.

For a couple of years now I’ve been mooching around refactoring the intelligent agent paradigm to cover (RESTful) Web services. The kind of intelligence I have in mind is potentially, well, non-existent : a regular Web site could be considered an agent. The motivation is mostly that developing spec-compliant systems on the Web is in general a lot of work, and that this leads to either cutting corners/breaking specs or using frameworks that limit one’s opportunities for innovation. When we introduce Semantic Web technologies into the mix, things get even more difficult.

So what I was after was a simple abstraction of (Semantic) Web systems/services that would allow a lot of the gruesome details of implementation to be hidden away, without breaking the Web. What I came up with looks like this:

An archetypal agent would feature (access from) a HTTP server and a local HTTP client for input & output, a local RDF model for its working memory along with some kind of business logic (behaviour) that would determine what it actually did. (I’m putting on hold one of the usual features of intelligent agents – mobility – though a story on this would be nice for issues like scalability). Agents are effectively self-contained, event-driven components with a common interface (HTTP).

A regular Web site could fit this abstraction in a degenerate form: no HTTP client, content is held in a persistent model, the behaviour is just to deliver that content to any other agents that make appropriate requests (in this case those other agents would typically be browsers, well-known degenerates).

In the past I must confess I’ve tried to express this stuff via MVC, which was a bit of a stretch – I agree with Ian’s view that this isn’t really appropriate for the Web. RMR, ROA or WOA (take your pick!) is a much better fit. Having said that, I’m not sure how much the developer should be operating on the level of resources and representations, they seem more like bricks and cement than architecture – e.g. conneg and httpRange-14 303s should Just Work.

So now (or rather, quite a while ago) I needed a proof of concept system that would allow easy construction of this kind of agent, and I spent a good many free-time hours putting together a little framework. The way I was approaching it (in Java) was for the framework to provide a container for agents, and those agents being aware whether or not they were in the same container. If they were, they could address each other directly, while still supporting HTTP I/O for communications otherwise.

I got quite a long way, despite hitting numerous snags (incorporating asynchronous eventing into the HTTP request/response cycle was a good one). But then as of a few months ago didn’t have much opportunity to look at this stuff.

Fast forward to a few weeks ago. In my todo queue was getting down deep with OpenID and OAuth (which I’m familiar with but haven’t really stress-tested), and it was hard not to imagine using the agent approach to play with these components. Coincidentally I went up to visit Reto in Switzerland and the company he now works for – Trialox who are (amongst other things) building a Semantic Web CMS. While I was up there, Reto gave me an intro to OSGi (formerly the Open Services Gateway initiative) which is essentially a set of specs for a Java-based service platform – it’s used in Eclipse, for example. Somewhat bizarrely I think I missed out on learning about this previously because I must have glazed over when seeing the acronym, confusing it with OGSI (the Open Grid Services Infrastructure).

To cut a long story marginally shorter, I’ve now ditched my own agent framework code (I can no doubt recycle bits) in favour of OSGi, and am currently noodling with creating the appropriate bundles – as OSGi calls its components – for the agent stuff, using Apache Felix as the host framework. I’ve still a good way to go before I get to my proof of concept, but after only a couple of days learning/coding I’m already making much more rapid progress than I was with my own ad hoc stuff. With a bit of luck I’ll have testbed stuff together for OpenID & OAuth (and related setups like FOAF+SSL) within the next week or so. I’m obviously also going to be looking at hooks into the Talis Platform. I can’t remember offhand whether it was Ian, Leigh or Sam, but someone’s already put together a load of Java client code to wrap HTTP interactions with the Platform, so most of the work there’s already been done.

Oh yeah, and I reckon OSGi might well give me a neat approach to the Semantic Web in a Box.

[Work in progress is currently in my personal svn://hyperdata.org/svn/ but I'll move it into the n² svn once I've got something more functional].

GRDDLing DeWitt’s Friends

DeWitt Clinton has a great write-up of Creating a HTML “friends” page from a Google Reader subscription list, a bit of hackery which leads to a hCard microformat-enriched friends list. A little tweak to the HTML can make it more machine-friendly, just adding a HTML Meta Data profile URI:

<head profile="http://www.w3.org/2006/03/hcard">

That profile is GRDDL-enabled, so any GRDDL-aware agent can interpret the source document as RDF. This part’s easy to demonstrate, thanks the online W3C GRDDL service. So I’ve put a tweaked version of the HTML online, and here’s DeWitt’s friends page as RDF (in Turtle syntax, rendered a little verbosely).

Having set this up I realised the data wasn’t actually expressing the friend relationship, so went on to put together some SPARQL to sort that out – below. But afterwards I realised that DeWitt’s HTML was actually expressing the relationships using XFN class names, but again without the profile URI to make it machine-friendly. So another tweak:

<head profile="http://www.w3.org/2006/03/hcard http://www.w3.org/2003/g/td/xfn-workalike">

- the corresponding service output (scroll down to see the extra bits). I suppose I should mention that you can have as many space-separate profiles as you like, and the GRDDL-aware agent will interpret them independently, just accumulating all the triples. The second profile URI adds xfn:friend relationships, I think it would have been more useful with foaf:knows as well, but it is only a demo.One of these days the microformats folks might get around to tweaking the official profile appropriately…

The SPARQL I mentioned looks like this:

prefix rdf:
prefix vcard:
prefix foaf:

CONSTRUCT
{
[ a foaf:Person;
foaf:homepage ;
foaf:name "DeWitt Clinton" ;
]
foaf:knows
[ a foaf:Person;
foaf:homepage ?homepage ;
foaf:name ?name ] .
}
WHERE
{
[ a vcard:VCard ;
vcard:url ?homepage ;
vcard:fn ?name ]
}

- when applied to DeWitt’s data (as RDF), this will map it across from the vCard vocabulary – finding the appropriate ?variables by matching the pattern in the WHERE clause, inserting those ?variables into the CONSTRUCT clause to produce some new RDF.

I tried this on the Redland SPARQL demo, and I think it’s producing the RDF I wanted. Unfortunately the serialization is really ugly – lots of bnodes, and it’s hard to check visually. It appears to confuse Tabulator too, and the W3C RDF Validator which is handy for this kind of visualization appears to be down. (Here’s a copy of the RDF/XML). Still, it was only a workaround – with the right profiles in place it’s not needed.

I’m not sure if there’s a microformat way of expressing that the source data was a subscription/reading list. To get the richest RDF out it might be easier to do what DeWitt did, but to a full RDF serialization rather than microformatted HTML (which is effectively a CustomRdfDialect), producing something like Planet RDF’s blogroll.