Subscribe

Archive for the 'Ideas and Experiments' Category

voiD, datasets, graphs, documents, and dcterms:isPartOf backlinks

One thing that I have heard people asking several times now regarding voiD is to do with how to say that data is part of a dataset.

Frédérick Giasson asked about this recently in #swig, and wondered why the voiD guide recommended using dcterms:isPartOf. I thought, since this is something that has been asked about a few times, I would blog about it and explain the reasoning behind this.

So, it wouldn’t be right to say something like:

<http://lastfm.rdfize.com/artists/Black+Sabbath> dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

… because we don’t want to say that “Black Sabbath is part of the lastfm.rdfize.com dataset”.
We want to say “a description of Black Sabbath (composed of triples) is part of the lastfm.rdfize.com dataset“.

One approach to encapsulating this meaning would be to reify each individual triple and state that the triple is part of the dataset … but we felt that this would be neither practical nor popular.

So, in the voiD guide, we advocate that when you publish Linked Data, and you want to say that the data you are publishing is part of a voiD Dataset, you add a triple linking the document in which the data is published, to the dataset. eg:

<http://lastfm.rdfize.com/?artistName=Black+Sabbath> terms:partOf <http://lastfm.rdfize.com/meta.n3#Dataset> .

(where <http://lastfm.rdfize.com/?artistName=Black+Sabbath> is a document containing a description of <http://lastfm.rdfize.com/artists/Black+Sabbath>)

This way, when a Linked Data client dereferences <http://lastfm.rdfize.com/artists/Black+Sabbath> they get redirected to a document, and can follow the dcterms:isPartOf link from the document URI to the voiD Dataset.

What some people don’t like so much, is the implication that their dataset consists of documents, when what they really want to say is that their dataset consists of descriptions of resources.

The conceptual problem, if there is one, is that here the document URI is identifying an RDF/XML document, not the graph of RDF data encoded in that document. So, if you wanted to explicitly state that the graph, rather than the document, is part of the dataset, it could perhaps be done like this:

[ a <http://www.w3.org/2004/03/trix/rdfg-1/Graph> ;
<http://purl.org/vocab/frbr/core#embodiment> <http://lastfm.rdfize.com/?artistName=Black+Sabbath&output=rdf> ;
dcterms:isPartOf <http://lastfm.rdfize.com/meta.n3#Dataset> .
]

But I’m really not too sure if that is either semantically correct, or in any way a more practically useful description than simply saying the document is part of the dataset.

We (the voiD guide authors) think that the <document> dcterms:isPartOf <dataset> pattern is the most pragmatic approach to making a dataset discoverable from a LOD document.
But we are also open to suggestions for improvement as we evolve the vocabulary and guide in line with popular usage and the requirements of LOD publishers.

What do you think?

A MalBestPractice with RDF: Making Assumptions

Michael Hausenblas has a new blog post listing some common malpractices when working with RDF.

RDF is a model, not a format

I especially agree with his point about “Thinking of RDF on the serialisation level” (as a malpractice) - grabbing values from RDF/XML or RDFa wih XPath or regexes is not wise. It is making an unsafe assumption about the stability of the serialisation. In fact, if you are writing a Linked Data application, there are very few assumptions you can safely make, about either the serialisation, or the model.

RDF isn’t SQL, XML, OO …

So maybe my favourite MalBestPractising is: trying to treat RDF too much like some other software paradigm - too much like a relational database, too much like OO, too much like XML. It’s enticing to try to write software that treats RDF as if it was something that the mainstream of software development are more familiar with, to try to use the same kind of techniques and shortcuts. But these shortcuts often rely on assumptions that can’t be made about RDF data (at least, not proper, organic, free-range RDF from the web). You can’t assume that the same RDF graph will be serialised the same way as last time. You can’t assume that the http://xmlns.com/foaf/0.1/ namespace will always be bound to the foaf prefix. You can’t assume that a resource will, or won’t have a particular property, just because it has another property, or a particular type. If you don’t know that a statement exists, you can’t assume it doesn’t, only that you don’t know about it. et cetera.

Not making these assumptions can be tedious, and at times problematic, but ultimately, the less assumptions you write into your code, the more interesting, open, and ‘webby’ your application can be.

Less assumption, less code, more data, more web

The huge game-changing thing about web development with the Web of Data though, is not the set of assumptions you can’t make, but the assumptions you don’t have to make . Thanks to the Follow Your Nose principle espoused by Linked Data, you don’t need to write assumptions about your data into your code; you can instead let the application “follow its nose” to find out more about the data.

You can follow vocabulary term URIs to find out how they can be used, how they can be labeled, and what inferences can be drawn from their use. You can follow owl:sameAs and rdfs:seeAlso links to find out more about a resource. You can use semantic index services like Sindice to find occurrences of a URI or keyword across the Web of Data. You can follow dcterms:partOf links from RDF documents back to voiD Datasets, which will often have links you can follow to licenses that tell you how the data can be used, and to other services (such as SPARQL endpoints).

The more data is published, not just within datasets, but about datasets, and about services , the more we can write applications that open up to the web, and the fewer lines of code we will need to do it!

Vocabify: Instance Data -> Vocab

One thing about writing RDF vocabularies that occurred to me listening to people talk at VoCamps (Oxford and Galway), is that typically what you are trying to do isn’t defining new terms, it’s modeling data, and at some stage in the modeling you discover you need to write a new vocabulary. Vocabulary authors often want to describe how their terms can best be used with existing complimentary vocabularies, like FOAF and Dublin Core, but the only commonly practiced way of doing so is to put it in human-readable form in the documentation annotations. In voiD, we wrote a guide, principally because we wanted to describe how the terms ought to be used together with existing vocabulary terms.

In tandem with this thought, when sketching out vocabularies myself, I tend not to start out by defining Classes and Properties, which is both tediously repetitive, and a step removed from the data-modeling (which is what I’m actually trying to do in the first place). Instead, I define a prefix for a new namespace, and pretend a vocabulary already exists at it. Probably quite a lot of people do this. I think of them as “pretend schemas“; I’ve heard ldodds call them “just in time schemas” (only bother to write it when someone actually asks to see it).

So last night I coded up Vocabify, which you can feed some instance data that uses your “just in time vocabulary“, tell it which namespace URI is the pretend one, and it will generate a schema from the instance data, which you can then edit and publish.

The classes and properties are also linked to the instances they are generated from with ov:exampleResource, so it is clear to readers how they can be used together with other properties.

(Semantic) Web Agents and OSGi

A little fyi/progress report.

For a couple of years now I’ve been mooching around refactoring the intelligent agent paradigm to cover (RESTful) Web services. The kind of intelligence I have in mind is potentially, well, non-existent : a regular Web site could be considered an agent. The motivation is mostly that developing spec-compliant systems on the Web is in general a lot of work, and that this leads to either cutting corners/breaking specs or using frameworks that limit one’s opportunities for innovation. When we introduce Semantic Web technologies into the mix, things get even more difficult.

So what I was after was a simple abstraction of (Semantic) Web systems/services that would allow a lot of the gruesome details of implementation to be hidden away, without breaking the Web. What I came up with looks like this:

An archetypal agent would feature (access from) a HTTP server and a local HTTP client for input & output, a local RDF model for its working memory along with some kind of business logic (behaviour) that would determine what it actually did. (I’m putting on hold one of the usual features of intelligent agents - mobility - though a story on this would be nice for issues like scalability). Agents are effectively self-contained, event-driven components with a common interface (HTTP).

A regular Web site could fit this abstraction in a degenerate form: no HTTP client, content is held in a persistent model, the behaviour is just to deliver that content to any other agents that make appropriate requests (in this case those other agents would typically be browsers, well-known degenerates).

In the past I must confess I’ve tried to express this stuff via MVC, which was a bit of a stretch - I agree with Ian’s view that this isn’t really appropriate for the Web. RMR, ROA or WOA (take your pick!) is a much better fit. Having said that, I’m not sure how much the developer should be operating on the level of resources and representations, they seem more like bricks and cement than architecture - e.g. conneg and httpRange-14 303s should Just Work.

So now (or rather, quite a while ago) I needed a proof of concept system that would allow easy construction of this kind of agent, and I spent a good many free-time hours putting together a little framework. The way I was approaching it (in Java) was for the framework to provide a container for agents, and those agents being aware whether or not they were in the same container. If they were, they could address each other directly, while still supporting HTTP I/O for communications otherwise.

I got quite a long way, despite hitting numerous snags (incorporating asynchronous eventing into the HTTP request/response cycle was a good one). But then as of a few months ago didn’t have much opportunity to look at this stuff.

Fast forward to a few weeks ago. In my todo queue was getting down deep with OpenID and OAuth (which I’m familiar with but haven’t really stress-tested), and it was hard not to imagine using the agent approach to play with these components. Coincidentally I went up to visit Reto in Switzerland and the company he now works for - Trialox who are (amongst other things) building a Semantic Web CMS. While I was up there, Reto gave me an intro to OSGi (formerly the Open Services Gateway initiative) which is essentially a set of specs for a Java-based service platform - it’s used in Eclipse, for example. Somewhat bizarrely I think I missed out on learning about this previously because I must have glazed over when seeing the acronym, confusing it with OGSI (the Open Grid Services Infrastructure).

To cut a long story marginally shorter, I’ve now ditched my own agent framework code (I can no doubt recycle bits) in favour of OSGi, and am currently noodling with creating the appropriate bundles - as OSGi calls its components - for the agent stuff, using Apache Felix as the host framework. I’ve still a good way to go before I get to my proof of concept, but after only a couple of days learning/coding I’m already making much more rapid progress than I was with my own ad hoc stuff. With a bit of luck I’ll have testbed stuff together for OpenID & OAuth (and related setups like FOAF+SSL) within the next week or so. I’m obviously also going to be looking at hooks into the Talis Platform. I can’t remember offhand whether it was Ian, Leigh or Sam, but someone’s already put together a load of Java client code to wrap HTTP interactions with the Platform, so most of the work there’s already been done.

Oh yeah, and I reckon OSGi might well give me a neat approach to the Semantic Web in a Box.

[Work in progress is currently in my personal svn://hyperdata.org/svn/ but I'll move it into the n² svn once I've got something more functional].

Exploring OpenLibrary Part Two

More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.

Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.

First off, a correction - foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.

It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.

In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.

So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…

“^[0-9]{1,4}$” - A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.

I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.

The model we end up with is like this


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL149323A>
	foaf:Name "Schaller, Heinrich";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
	bio:event <http://example.com/a/OL149323A#birth>;
	a foaf:Person .

<http://example.com/a/OL149323A#birth>
	dc:date <http://example.com/a/OL149323A#birthDate>;
	a bio:Birth .

<http://example.com/names/schallerheinrich>
	mine:name_of <http://example.com/a/OL149323A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1900>
	time:unitType time:unitYear;
	time:year "1900";
	a time:DateTimeDescription .

<http://example.com/a/OL149323A#birthDate>
	time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
	a time:Instant .

The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…

First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL151554A>
	foaf:Name "Altdorfer, Albrecht";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
	bio:event <http://example.com/a/OL151554A#birth>;
	bio:event <http://example.com/a/OL151554A#death>;
	a foaf:Person .

<http://example.com/a/OL151554A#birth>
	dc:date <http://example.com/a/OL151554A#birthDate>;
	a bio:Birth .

<http://example.com/a/OL151554A#death>
	dc:date <http://example.com/a/OL151554A#deathDate>;
	a bio:Death .

<http://example.com/names/altdorferalbrecht>
	mine:name_of <http://example.com/a/OL151554A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1480>
	time:unitType time:unitYear;
	time:year "1480";
	a time:DateTimeDescription .

<http://example.com/a/OL151554A#birthDate>
	mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
	a time:Instant .

Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.

Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.

Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ - 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.

In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.

Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.

Exploring OpenLibrary Part One

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:


{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks         "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks         "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them - then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.com/a/OL32312A>
	foaf:Name "Banks, Iain";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
	bio:event <http://example.com/a/OL32312A#birth>;
	a foaf:Person .

<http://example.com/a/OL32312A#birth>
	bio:date "1954";
	a bio:Birth .

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.


@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
	mine:name_of <http://example.com/a/OL32312A>;
	a mine:Name .

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges - these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…

VRM with FOAF + OpenID

A quick note-to-self. I’m currently working on some other FOAF + OpenID stuff, so this is nearby enough that I might well put together a demo in the near future…but not today.

Tim Bray discusses Changing your address in the context of Vendor Relationship Management, prompted by “Feeds-Based VRM”: A Web-Centric Approach to VRM Implementation. The question is how you keep a vendor (or other contact) aware of your current address.

I came to the same conclusion as Tim, that feeds aren’t really necessary for this kind of thing, the data can be put directly on the Web and the contact given the appropriate URI. In comments over there I pointed to Tim Berners-Lee’s Give yourself a URI - an online FOAF profile solves most of the problem. The part it doesn’t solve is access control - you might not want to make your address public. But with the help of linked data, off-the-shelf tools and a little scripting, this is pretty easy to fix.

First of all, looking at how you might represent this information, vCard is the dominant model for this kind of info. Whether that’s expressed in the original vCard format or hCard or RDFa or RDF/XML doesn’t really matter. These can all be mapped to the RDF model, which is key to what follows… Here’s the relevant bit of a vCard in Turtle syntax (first pass, probably not 100% correct):

prefix : <http://www.w3.org/2006/vcard/ns#> .
[ a :VCard;
:agent <#me>
:homeAdr [
a :Address;
:street-address "7, Mozzanella" ;
:country-name "Italy"
] ;
]

Now I could just dump this in my public FOAF profile at, say http://example.org/public/me. But because I want the address to be restricted, I’ll separate the information (following the principles of linked data) like this -

in http://example.org/public/me -

prefix : <http://www.w3.org/2006/vcard/ns#> .
[ a :VCard;
:agent <#me>
:homeAdr <http://example.org/restricted/myaddress> .
]

and in <http://example.org/restricted/myaddress> :

prefix : <http://www.w3.org/2006/vcard/ns#> .
<> a :Address;
:street-address "7, Mozzanella" ;
:country-name "Italy" .

Now I need to wrap the latter part in authentication/authorization. Traditionally I might hard-code a list of who can see this data, but there’s a neater way. Somewhere I’ll put statements like the following (with proper URIs as appropriate):

<#me> foaf:knows [
<personA> foaf:openid <personAopenID>
]
<#me> x:businessContact [
<personB> foaf:openid <personBopenID>
]
<#me> x:businessContact [
<personC> foaf:openid <personCopenID>
]
<#me> x:businessContact [
<personD> foaf:openid <personDopenID>
]

Anyone wishing to see the restricted info will be asked for their OpenID URI. Whether they can see a particular resource can be governed by simple rules, for example expressed through string-templated SPARQL queries:

SELECT ?person
WHERE {
?person foaf:openid $openid$ .
OPTIONAL { <#me> foaf:knows ?person }
OPTIONAL { <#me> x:businessContact ?person }
}

Ok, that’s very sketchy, but hopefully gives the idea. To be properly declarative in practice you’d probably want to put the access rules in a separate chunk of RDF, and query across the whole lot. But given decent libraries (e.g. the OpenID PHP lib worked pretty much out of the box for me, and ARC is a really straightforward PHP RDF toolkit), we’re talking about maybe a days work to write and deploy the scripts - which could be used by anyone else with regular PHP-capable hosting.

A Web-centric approach to VRM should use the Web, and as Berners-Lee himself recently put it:

Linked Data is the Semantic Web done as it should be. It is the Web done as it should be.

Ad hoc plumbing

Half an hour ago I discovered microrevie.ws, a Twitter-based review site. I couldn’t resist a quick play.

Like twitcrit 1.1 [currently not working *] and twitcrit 2.0, these reviews are authored in Twitter using a few (different) conventions. The microrevie.ws page is HTML with embedded microformats, which made me think right away of GRDDL. There were a couple of slight snags - the HTML isn’t XHTML its HTML5, and the page doesn’t declare HTML Meta Data profiles for the microformats. So here we go…

  1. live microrevie.ws page
  2. (1) piped through an online HTML Tidy service to yield a XHTML version
  3. (2) with a simple XSLT applied to insert @profile, using an online XSLT service to yield GRDDL-friendly XHTML
  4. (3) sent through triplr.org to yield RDF - here in Turtle syntax

Look ma…no import/export!

I’m pretty sure the GRDDL transformations aren’t 100% complete/accurate, and I couldn’t find one for hAtom which is used in the source, but there’s enough to show a lot of triple, generated live from the source simply by hooking together the URIs. Check this -
http://triplr.org/turtle/http://www.w3.org/2000/06/webdata/xslt?xslfile=http%3A%2F%2Fhyperdata.org%2Fxslt%2Fprofiles.xsl&xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3FdocAddr%3Dhttp%253A%252F%252Fmicrorevie.ws%252F%26forceXML%3Don&transform=Submit

Not exactly the kind of thing you’d want on your business card, but it is bookmarkable/linkable.

Oh yeah - want a visualization with that? Flip triplr.org to /rdf/ and try it on the RDF Validator : http://www.w3.org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Ftriplr.org%2Frdf%2Fhttp%3A%2F%2Fwww.w3.org%2F2000%2F06%2Fwebdata%2Fxslt%3Fxslfile%3Dhttp%253A%252F%252Fhyperdata.org%252Fxslt%252Fprofiles.xsl%26xmlfile%3Dhttp%253A%252F%252Fcgi.w3.org%252Fcgi-bin%252Ftidy%253FdocAddr%253Dhttp%25253A%25252F%25252Fmicrorevie.ws%25252F%2526forceXML%253Don%26transform%3DSubmit&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_BOTH&FORMAT=PNG_EMBED

(scroll down towards the bottom & right to see the graph - it’s a bit big)

Should work in the Tabulator too.

Incidentally, there’s a neat trick at microrevie.ws: the subject of reviews gets posted to twitter as a simple string (ending with a ‘;’) but gets turned into a URI, e.g.
http://microrevie.ws/reviewables/Sparks+Alcoholic+Caffeinated+Beverage

Could come in handy for those times you really want a literal as the subject of a triple. Right now microrevie.ws has Google hooked up to help you find out what the thing is. Making that more explicit seems like it might be a job for Open Linking Data

* I’m pretty sure there won’t be much difference in complexity of the operational code between twitcrit 1.1 and twitcrit 2.0. One possible explanation for the reason the latter is still running (even though I haven’t looked at it in months) and the former isn’t might be that the code proper in the working version is just a simple bit of Python, loosely coupled to some Software as a Service doing storage + SPARQL elsewhere (it’s on the Talis Platform). If I’d had to run that bit of infrastructure myself, I doubt very much whether that’d still be running.

Import/export and the Web

A post on the Open Data Definition list from Ben Werdmuller asks an interesting question - is syndication an easier sell than import/export?

Ok, background first: Open Data Definition is a proposed format for transfer of data between systems, with DataPortability in mind. In many respects it’s a ‘lite’ reinvention of RDF, targeted at the average Web developer. While I and others might question the underlying assumption that RDF is too difficult for typical Web developers, and perhaps express a little gut-reaction pushback, there’s nothing inherently wrong with something like this if it fills a (possibly significant) niche, and plays nicely with other Web standards. Design-wise, there is a sanity check which can be applied, the Test of Independent Invention :

If someone else had already invented your system, would theirs work with yours?

Does/could RDF work with ODD? - well, nearly. Yes, because it should be reasonably straightforward to map between RDF graphs and ODD’s format (there’s an interesting little complication in its indirection of metadata that’d take a bit of figuring, but bashing it with SPARQL & XSLT for a while would probably suggest a good approach). It fails right now because ODD doesn’t as yet allow for transparent interpretation, not having an XML namespace, hence not really placing itself on the global Web. Any automatic conversion would have to be done by sniffing the content - an agent needs complete prior knowledge. [If the ODD folks are willing to give the format a namespace, I'll volunteer to sort out the mappings & GRDDL bits]. Hmm, I wonder if they’ve tried nesting ODD in other XML formats yet…

Anyhow, back to Ben’s question. I think he has a point - syndication should be a relatively easy sell these days because of RSS/Atom. But marketing aside, there are several different ways to get the data from system A to system B:

  1. import/export where the data is transferred through an intermediary (i.e. the desktop)
  2. one-off direct transfer (system B does a GET to system A)
  3. polling - traditional syndication, periodic transfer
  4. linkage - lazy polling, any transfer happens on demand

At this point in time, the first of these isn’t exactly Web-friendly, typically requiring a human intermediary for its operation. In future, with smarter clients maintaining a local cache of data, something like this might make more sense. Such clients could be acting as proxies for any of the other modes of connection. But let’s assume this kind of capability’s already here. If you stand back, the same thing is happening in all these cases - the receiver will be given an identifier for the resource of interest (the profile data or whatever) and can use HTTP on it as appropriate. This is completely independent to what’s in the data itself - even though RSS/Atom formats contain a series of time-stamped entries, the way they get processed is up to the consumer. These different modes are orthogonal to authentication/authorization and privacy or copyright issues. Each is, in its own way, using linked data. To get more information about something, the consumer follows its nose and dereferences the URIs. ‘Course if you bring message content into the equation and/or allow an arbitrary number of agents in the interaction, the number of possible modes explodes.

So yeah, ok, what point am I trying to make here…dunno, it just seems somehow significant that questions like “syndication or import/export?” should arise, given the underlying infrastructure. More telling of the silo nature of many current Web systems - themselves generally products of a pre-Web mindset - than anything to do with the Web itself. This too shall pass, as they say.

See also: Walled gardens: mapping the parties

PS. Reminds me - in my little DP video I had a mockup of a “Connect!” button. It was only a mockup because of the deadline for videos, the implementation I had in mind being essentially OpenID + HTTP GET + SPARQL CONSTRUCT

GRDDLing DeWitt’s Friends

DeWitt Clinton has a great write-up of Creating a HTML “friends” page from a Google Reader subscription list, a bit of hackery which leads to a hCard microformat-enriched friends list. A little tweak to the HTML can make it more machine-friendly, just adding a HTML Meta Data profile URI:

<head profile="http://www.w3.org/2006/03/hcard">

That profile is GRDDL-enabled, so any GRDDL-aware agent can interpret the source document as RDF. This part’s easy to demonstrate, thanks the online W3C GRDDL service. So I’ve put a tweaked version of the HTML online, and here’s DeWitt’s friends page as RDF (in Turtle syntax, rendered a little verbosely).

Having set this up I realised the data wasn’t actually expressing the friend relationship, so went on to put together some SPARQL to sort that out - below. But afterwards I realised that DeWitt’s HTML was actually expressing the relationships using XFN class names, but again without the profile URI to make it machine-friendly. So another tweak:

<head profile="http://www.w3.org/2006/03/hcard http://www.w3.org/2003/g/td/xfn-workalike">

- the corresponding service output (scroll down to see the extra bits). I suppose I should mention that you can have as many space-separate profiles as you like, and the GRDDL-aware agent will interpret them independently, just accumulating all the triples. The second profile URI adds xfn:friend relationships, I think it would have been more useful with foaf:knows as well, but it is only a demo.One of these days the microformats folks might get around to tweaking the official profile appropriately…

The SPARQL I mentioned looks like this:

prefix rdf:
prefix vcard:
prefix foaf:

CONSTRUCT
{
[ a foaf:Person;
foaf:homepage ;
foaf:name "DeWitt Clinton" ;
]
foaf:knows
[ a foaf:Person;
foaf:homepage ?homepage ;
foaf:name ?name ] .
}
WHERE
{
[ a vcard:VCard ;
vcard:url ?homepage ;
vcard:fn ?name ]
}

- when applied to DeWitt’s data (as RDF), this will map it across from the vCard vocabulary - finding the appropriate ?variables by matching the pattern in the WHERE clause, inserting those ?variables into the CONSTRUCT clause to produce some new RDF.

I tried this on the Redland SPARQL demo, and I think it’s producing the RDF I wanted. Unfortunately the serialization is really ugly - lots of bnodes, and it’s hard to check visually. It appears to confuse Tabulator too, and the W3C RDF Validator which is handy for this kind of visualization appears to be down. (Here’s a copy of the RDF/XML). Still, it was only a workaround - with the right profiles in place it’s not needed.

I’m not sure if there’s a microformat way of expressing that the source data was a subscription/reading list. To get the richest RDF out it might be easier to do what DeWitt did, but to a full RDF serialization rather than microformatted HTML (which is effectively a CustomRdfDialect), producing something like Planet RDF’s blogroll.