Subscribe

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) – grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm – too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

Styles of Web Application – FlowPHP

Ian blogged a while back about why MVC is a rubbish pattern for web development because it doesn’t describe the problem in a way that helps you understand it better. I completely agree, and it’s surprising how much “received wisdom” there is about MVC being the right way to do things, but the natural response is, Well, what isn’t a rubbish pattern then?

Someone asks that in the comments on the blog post, and Ian replies:

Doesn’t REST define the pattern you need: resource/representation? Your application uses the URI to locate the appropriate resource and asks it to produce the appropriate representation.

I’m not completely happy with that as an answer though. To me, REST defines the interface to your application, and while it helps define at least that part of the problem, it doesn’t really give you enough of a solution. It doesn’t help you decide how to structure your code in the same way that MVC does (even if that decision is ultimately suboptimal).

I’ve been writing web apps in a similar style to that used by RESTful frameworks like Tonic and web.py, which I guess could be described as what rsinger called “_VC” on #talis the other day. Basically you have different ‘Resource’ classes that map to your application’s url design and return representations when, eg, a GET, or a POST method is called on them. A great boon of developing with RDF is that, because all data is the same shape, you can do things pretty generically, and write less domain-specific code. So I tried to keep my resource classes as generic as possible, and have different url routes set up the classes with different parameters as need be.

However, I’ve been growing pretty dissatisfied with this way of doing it, because it still seemed to be obscuring too much of the problem for me conceptually. There was still a problem of, ‘OK, where is the best place to put this‘, and a constant tension between whether to try to extend a generic class to cope with another situation, or writing a new one to do what you want. So I’d end up with a lot of classes that did a lot of pretty similar things (retrieving SPARQL queries, parsing them, passing data to the template), but not similar enough to be able just to do it with one class. I also found that class inheritance was a slightly messy way to share functionality, and it could be annoying to try to remember which class was used for which url space, and look it up in the routing configuration, and it wasn’t very amenable to serving representations derived from a combination of data sources.

So the other day I had an idea for a different style, which I’m pretentiously code-naming ‘FlowPHP’ (pronounced floaf – the P is silent ;) ).

The motivation is to try to model the process of receiving a request and returning a response as a chain of modular bits of code that create a response from the incoming request, and filter it until it is served. I’ve been trying this idea out, and so far, it looks like this:


try{
$KwijiboDev1 = new Store('http://api.talis.com/stores/kwijibo-dev1');
$R = new Request(array('SERVER' => $_SERVER, 'GET' => $_GET));
switch(true):
	case $R->is('GET','/posts'): // method is GET and url is /posts
		$R->response()->
                        checkCache()->
                            RDFList($KwijiboDev1, SIOC.'Post')->
                                SmushGraph()->
                                   serve('posts','main');
		break;
	case $R->is('GET','/post', array('uri')):
		 $R->response()->
                            checkCache()->
                                 CBD($KwijiboDev1, $R->GET['uri'])->
                                    serve('post','main');
		break;
	default:
		throw new HTTP_404("Page could not be found");
endswitch;
}
catch (Exception $e){
	echo $e->serve('error','main');
}

So what this is doing, is:

  • building a Request object with data from the $_SERVER and $_GET variables.
  • Checking the HTTP REQUEST METHOD, the REQUEST URI, and (optionally) for the existence of any required parameters.
  • Processing the Request and serving a Response by:
    1. Checking for a cached version we could serve first
    2. Retrieving the data: eg, CBD
    3. processing the data (eg: SmushGraph)
    4. Serving it in templates (serve() takes a variable length list of templates as parameters, rendering each inside the next template in the list)
  • Responding with an appropriate error if necessary (eg, HTTP 404, 405, 406, 500 – I pinched the idea of modelling 4xx and 5xx as Exceptions from Konstrukt)

Each ‘method’ in the chain, up until ‘serve()’, is returning the altered response object for the next method to manipulate. The methods that deal with adding data to the response, doing stuff with data, etc, aren’t really methods at all, but dynamically-called functions from a separate file. The reason I did it like this is I think it might be more modular and extensible, whilst not necessitating the creation of lots of different subclasses of Response.

This is still all evolving of course, and some/all of the ideas might turn out to be rubbish, but the thing I’m liking so far is the transparency: I think it’s relatively easy to see what’s going on with the code – what happens where, and when. The thing I’m experimenting with, I suppose, is the level of abstraction – my previous approach was perhaps too high-level and inflexible, which resulted in either lots of code, or lots of configuration, and the routing was kept too separate from the logic of returning the response.

The particular tension I’m finding with trying to develop flowphp at the moment, is to find a good idiom for setting variables midway through the chain of events – I’m loathe to have to break out of the chained methods, but maybe that’s only for aesthetic reasons.