By Alex Tucker
| This article will feature in Nodalities Magazine, Issue 6
Confessions of a Semantic Web Junkie
Let me start with a confession: I’ve been banging on about the semantic web to anyone who will listen to me for the past nine years. For some apparently deep seated reason, which some have even labelled perverse, I’ve kept at it despite endless meetings, misunderstandings, off-handed dismissals and blatant refusals to accept what seems to me to be obvious. It’s been a long and sometimes despairing nine years, but looking around now at the state of all things semantic web, one can’t fail to realize that it’s finally taking off and that companies like Talis are fully committed to its success.
During much of this time I’ve been working for various defence organisations showing how using semantic web standards and tools can help solve issues of interoperability — getting different systems to talk to one another. That the semantic web, and more specifically the semantic web ontology language OWL, can address these kinds of issues shouldn’t be too surprising, given that the US Defense Advanced Research Projects Agency (DARPA) helped fund the process which eventually led to OWL, precisely to solve issues of semantic interoperability.
More recently, I’ve had the pleasure of working with various teams and projects within NATO who have embraced the semantic web ideas and expanded on them in novel ways. A fundamental shift in attitude in defence, which has in part been driven by the 9/11 Commission’s findings, is that intelligence agencies must move from a ‘need to know’ to a ‘need to share’ mentality. For me, this shift has echoes in the ‘linked open data’ ideals, and for someone who also bangs on about open source, open standards and all things open, it’s a step in the right direction. Obviously, sharing information in a military context is a little different to the ideals of linked open data, but the fundamental issues around getting the information out of existing silos and usable by other systems are much the same.
A recent success in one such NATO project has been the decision to move from using a bespoke query protocol for RDF, a sort of ‘query by example’, to using the World Wide Web Consortium’s new semantic web query language, SPARQL. This might seem obvious, but a few years ago when the protocol was being developed, SPARQL didn’t exist — therein lies another discussion on how organisations need to be more agile to be able to cope with the length of a ‘Web Year’. As far as the project group are concerned however, this is great news, as they don’t have to come up with a ‘Standardization Agreement’ or STANAG to define their own query protocol and can instead just point to and use an existing standard, SPARQL.
Bonjour
Another standard used by this project is Bonjour (formerly Zercoconfig or Rendezvous), Apple’s open standard for no-nonsense network configuration and service discovery—the technology behind iTunes’ ability to discover and play music from other iTunes applications on different computers. The great thing about using Bonjour for this project is that it’s decentralized: there’s no need to set up the registry of information providers beforehand. Instead, each provider is free to publish their existence and their capabilities, both on a local and wide area network, without setting up any new infrastructure. Each service in a local network can publish its details in any number of domains. The clouds above represent local networks, the arrows represent the act of publishing, or registering service details to a domain, the cylinders represent services, and the ‘.local’ is the domain for the local network. iTunes, for instance, currently publishes itself only to the local area network (which is different to the initial iTunes release where people realised they could share their music with friends over the internet by advertising their iTunes database to their own domain name, how terrible). The domains themselves correspond to the normal domain name system (DNS) names we’re used to, since DNS is really at the heart of the Bonjour protocol.
Services can then be discovered simply by looking for a particular service type in a particular domain; a typical Bonjour service discovery query would ask for all printer services in the local domain. These service types are given slightly cryptic names (iTunes, for example, uses _daap._tcp), but are all listed on-line at dns-sd.org/ServiceTypes.html for everyone to see and re-use.
Hello SPARQL
What we’ve done then is create a service type for SPARQL, allowing information providers to publish their SPARQL ‘endpoints’ to be discovered by information consumers. Technically, the SPARQL service type is _sparql._tcp and is listed at dns-sd.org/ServiceTypes.html along with a short description of the properties which should be published.
In keeping with the RDF mantra that “anyone can say anything about anything”, we’ve tried to ensure that anyone can publish a description of not only their own SPARQL services, but also those of others. A published SPARQL service record has two properties, one called ‘path’ and one called ‘metadata’, from which a client derives two URLs, the former pointing to the SPARQL service endpoint, and the latter pointing to some arbitrary (RDF/XML encoded) metadata it can fetch. The normal approach would be to just use simple paths (e.g. /sparql and /service-metadata.rdf) for the values of these properties, which would be interpreted as pointing to the discovered host (e.g. http://dbpedia.org/sparql and http://dbpedia.org/service-metadata.rdf). However, by using a full URL for the value of either property, a service record can be published which points to a service or some metadata on a different host entirely, allowing us to form what are essentially proxy records for existing SPARQL services which don’t (yet) use Bonjour service discovery.
The Bonjour specifications are firmly geared towards making the user’s life simpler, so for instance while the name of a service is really just a normal DNS name, the specifications insist that it should be a human readable name with proper capitalization, spaces etc., although no dots are allowed. The specifications also give guidance as to the sorts of properties and values a service record should contain, and are keen that service records be as concise as possible and leave much of the nitty-gritty details of discovering service capability to the main protocol, in our case SPARQL.
Into the voID
However, the SPARQL protocol doesn’t have much to say about what to expect of an endpoint other than, “here it is.” Our approach to this has been to allow a little bit of extra information in the service record, for instance using the ‘vocabs’ property to declare the URIs of any vocabularies the service uses. To find out more information about a service, a consumer can use the value of the ‘metadata’ property to fetch a fuller service description. The only restriction is that this service description should be marked up in RDF/XML — what better way to encode metadata?
As to what this service description metadata should contain, well that’s currently left to the provider. In an ideal world, using RDF and OWL to describe a service would mean that it is completely self-describing and that a consumer could fetch the document, fetch any referred ontologies, and figure out what it all means completely automatically. In reality, we at least need some existing, vocabulary to refer to, even if the client can infer and interpret meaning using ontologies. For us, that vocabulary is a bespoke ontology which is currently being worked on, but which is good enough for our needs right now.
So I was heartened to be shown a glimpse of voID last year, which upon further reading looks to offer exactly the sort of service description vocabulary needed to complement our Bonjour SPARQL service type. Another great thing about RDF (see, I’m such a zealot), is that an RDF document can easily use multiple vocabularies, and the instance data doesn’t even necessarily need to be linked together, so the service description document can happily accommodate the use of more than one service description vocabulary or ontology without breaking anything. We’ll see what happens, but my vote would currently be for voID to be the suggested minimum requirement.
Why?
There’s currently a list of SPARQL endpoints on the World Wide Web Consortium’s ESW wiki, along with a comment along the lines that the list will probably not get much bigger in the long run. The comment itself is well reasoned and not necessarily meant to be a negative one, but along with similar comments from peers certainly makes us stand back and wonder at the usefulness of any kind of registry, or even a decentralized set of registries, of SPARQL services.
There’s no doubt that the project I’m working with needs this right now, so it’s certainly useful, but what about when, and if, things get much bigger? A well known web technologist has published Bonjour service types for HTTP and HTTPS at dns-sd.org, but lately there doesn’t seem to be much take-up for this kind of facility, even if Apple’s Safari web browser has built in support with its Bonjour bookmarks feature.
On the other hand, Google does such a wonderful job, giving us a single endpoint to query the whole of the web, that it’s tempting to think that the semantic web will ultimately be sucked up in a similar fashion into a great big semantic data warehouse in the clouds, and there’ll be no need for anyone to offer up their own SPARQL endpoints.
My own expectation is that we’ll end up with something in between. On the one side there will be one or a few massive Google like entry points which will, by spidering across the web, perhaps following the linked data trail, suck up much of the semantic web and present it to users in a simple, easy to consume and re-use entry point. But just as when you’re searching for something specific, for instance buying a book, or trawling through your bank statement, you don’t start at Google, I think there are plenty of cases where you’d want to be able to automatically discover SPARQL endpoints which you can access directly and which offer you precisely the details you need, there and then.
The semantic web is different to the (syntactic?) web, in that it makes it much easier to take information from multiple sources and join it together into something new and more useful. Imagine if everyone’s desktop were part of the semantic web, with all the information about from emails, photos, music etc. offered up through SPARQL services (take a look at the Nepomuk project for an existing implementation of this). Imagining the sorts of applications we could build on top of the Semantic Desktop is an exciting prospect, but it’s not something I’d expect most people would want to make available through a single Google style data warehouse, even if there were privacy safeguards in place, in the same way that most people don’t want to necessarily share all their photos on Flikr or Picasa.
Using Bonjour to discover SPARQL services means that a user can easily create or select a list of domains where their client applications will look for published enpoints, be they public or private, and given the right credentials they will be allowed to use those services to build the sorts of applications they want, no matter how esoteric: show me a list of all the photos of my son, sorted by the number of teeth he had, using my local photos as well as the photos taken by his grandparents; give me a list of washing machines, along with prices from these suppliers, where the washing machine is no wider than 60cm and has a Which Best Buy rating over 70%.
Of course, if we start making all this sort of information available through our own SPARQL services, then there are all sorts of issues around trust, privacy, provenance and accuracy, but at the very least we are in control of our own data.
Try It!
Using Bonjour to publish and discover SPARQL services is simplicity itself, and I invite you to take a look at www.floop.org.uk/eagle/discovering-sparql for some command-line examples.
While at the moment none of the SPARQL servers out there publish their details using Bonjour, I’ve created a simple application for Tomcat which can publish any of the services it runs using Bonjour, and have used it to publish Joseki based SPARQL services. I’m also working on both a SPARQL endpoint and Bonjour publishing for Plone and Zope. See http://www.floop.org.uk/projects for more details of these nascent projects.
What would be great is if the existing SPARQL servers and triples stores out there could add optional support for registering the details of any endpoints they make available, using Bonjour. It would certainly help the uptake of SPARQL in the projects I’m involved with.
Alex Tucker is a self-employed semantic web consultant, specialising in defence.