Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Talis’ Tour

It’s been a busy couple of months for the Semantic Web research community. At the very end of May the European Semantic Web Conference
returned to Crete, where the series began in 2004. Now in its sixth year the conference reflected the vibrancy of the research community
in this area, the progress made to date, and the increased emphasis on deployment and uptake of Semantic Web technologies. The latter aspect
was noticeable in many parts of the conference, not least of which in the Semantic Web In Use track, a new addition to the ESWC series, co-chaired by Talis Researcher Tom Heath.

With adoption of Semantic Web technologies and Linked Data principles increasingly rapidly, many members of the research community met in
late June at Schloss Dagstuhl in Germany for a seminar titled “Semantic Web: Reflections and Future Directions”. Almost ten years since the first Dagstuhl seminar on the Semantic Web the goal of this event was to learn lessons from the past and map out the research agenda for the next ten years of the field. Again acknowledging the practical aspects of the field, there were lengthy and productive discussions on the topics of hosting and persistence of RDF vocabularies, and the urgent need to examine how Linked Data and the Semantic Web can enhance Human-Computer Interaction; both of which are topics close to our hearts at Talis.

The natural question that arises from exploring the next ten years of research in any field is “who’s going to do all the work?” Fortunately
in early July the Seventh Summer School on Ontological Engineering and the Semantic Web took place in Cercedilla, Spain, part-sponsored by
Talis. This annual event, directed by Enrico Motta (The Open University) and Asun Gomez Perez (Univ. Politécnica de Madrid), provides over 50 students from Europe and beyond with lectures, invited talks and group projects in cutting edge areas of the Semantic Web field, supported by a team of leading researchers. In addition to the knowledge gained from this intense week of study, students of the summer school get to network with their peers and build the very community that will drive forward the Semantic Web research agenda over the next ten years.

Interesting semantic web stuff

By Tom Scott
| This guest post originally appeared on Tom Scott’s blog; republished under CreativeCommons License, and with kind permission of the author.

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.

TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:

  • overseeing the creation of a single online point of access and work with departments to make this part of their routine operations.
  • helping to select and implement common standards for the release of public data
  • developing Crown Copyright and ‘Crown Commons’ licenses and extending these to the wider public sector
  • driving the use of the internet to improve consultation processes.
  • working with the Government to engage with the leading experts internationally working on public data and standards

The Guardian has an article on the appointment.

Closer to home there have been a few interesting developments

Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.

The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.

A social semantic BBC? [slideshare]
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.

PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
Including an interview with me!

You should also check out…

sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.

Enhanced by Zemanta

Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr

Growing the Web of Data with Data Incubator

At Talis we’re huge fans of Linked Data, especially when it’s freely available for reuse too. However, we also realise that not everyone has been smitten by the Linked Data bug yet so we’re always thinking about new ways to help others use, publish and discover the benefits of connecting their data together.

Recently we were wondering how we could help organise the skill and expertise of people who love Linked Data to show data publishers how their data could be even more useful and effective. As the Linking Open Data project has shown, actions speak louder than words so we wanted to do something with practical and visible results.

One problem we face is that until it is available in open and reusable formats it’s not possible to show data owners the power locked up in their own data. Conversely it is hard for the data owner to justify investment in opening up their data without concrete demonstrations of that power. A classic deadlock situation! The goal of our new project is to break this deadlock. We plan to do this by organising people around popular datasets to create mappings to RDF, write conversion code and openly publish the resulting data. The result will be a huge reduction in the investment needed by the data owner: they can simply adapt the work and emit the Linked Data themselves.

We call our new project the Data Incubator and if you love Linked Data then we encourage you to join in and help grow the web of data. Although this project is entirely independent of Talis, we are supporting it through the Talis Connected Commons scheme, providing free hosting and services for public domain data.

Already we have started projects to convert the Open Library dataset including much-loved books such as The Hobbit and to convert journal metadata provided by CrossRef, Highwire and the National Library of Medicine. Many more projects are being incubated and we are discussing how we create a repeatable process for contacting and encouraging data owners to take part.

Join the Data Incubator mailing list and get involved.

Talking about the upcoming European Semantic Web Conference, ESWC2009

In my latest podcast we take a look at the programme for this year’s European Semantic Web Conference (ESWC2009), which begins on the Greek island of Crete in a few weeks’ time.

Joining me for the call were the event’s General Chair Fabio Ciravegna and Programme co-Chair Lora Aroyo. We were also joined by Alan Smeaton, one of the invited Keynote speakers, who discussed the scope of his talk on synergies between semantic technologies, video and the emerging Sensor Web.

During the conversation, we refer to the following resources;

This conversation was recorded on Thursday 30 April, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.

Streams, Pools and Reservoirs

by Leigh Dodds
| this article features in Nodalities Magazine, issue 6

As we start to move past the current boot-strapping phase of the semantic web in which we are constructing the web of linked data, its useful to begin discussing what other feature and infrastructure we need in order to support sustainable usage of this huge and growing data set: what services can be offered over linked data? Do we need to consider how to provide quality of service, stability and longevity to the data, or does the sheer scale of the web make these moot points?

In order to answer this question it’s useful to compare the ongoing development of the linked data web with that of the web itself.

A Brief History Lesson

There have been several phases of activity in the development of the web. While in truth, these phases were of different duration, overlapped with one another, and have happened at different rates within different communities, essentially we have gone with the following basic steps.

Firstly we concentrated on just getting stuff on line. The early web was a new medium for document and data exchange and so was at its core a simple publishing device used as a collaborative space between small communities. But as the amount of content and the size and breadth of those communities grew, the emphasis shifted towards linking: tying content together to create, – initially hand-crafted – indexes of the web and knit the available content into a greater whole.

The second, manual linking phase was quickly supplemented by a third phase of automated linking between content: search engines. A search engine is simply a way to quickly create a link-base based on some search criteria. The crawling and indexing of the document web by web crawlers allows users to quickly construct links to content of potential interest.

If we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

The third phase of the web’s development has been triggered by the commoditisation of search and the need for search engines to differentiate themselves and offer additional value-added services. Search engine features are now tailored towards particular uses or types of content (Google Image Search; Google Scholar); offer value-added features that capitalise on the ability for search engines to analyse the structure and traffic flows across the web (PageRank and similar indexing improvements; Google Trends); expanding the audience for content (Google Translate); and enabling community-driven customisation of the search experience (Google Custom Search; Yahoo Search Monkey, etc).

No doubt there will be subsequent phases of development, and the perspective of history will let us tease out common strands of development some of which will already be happening. But if we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

History Recapitulated

There has been RDF data available on the web for many years, used by a limited community of researchers. This slow accumulation of content – echoing the first phase of content publishing on the document web – has been replaced by a rapid increase in data publishing encouraged through the Linking Open Data (LOD) project. By providing clear pragmatic guidance and instructions on how to publish data for the semantic web, that project has enabled us to accelerate our transition through that first content publishing phase. But it has also, crucially, encouraged the linking together of data sets (Phase 2).

This linking has to a great extent been manual. Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

The rapid growth of the Linked Data Cloud means that this “manual” phase will soon be over: there will be sufficient momentum behind the semantic web that increasing amounts of data will become available and no single community will be able (or need) to shepherd its development. The focus will shift towards the subject specific communities who will instead co-ordinate at a more local level. Semantic web search engines will also become a reality.

Semantic Web search engines need to be distinguished from semantically enabled search engines. The latter use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans. A Semantic Web search engine should offer infrastructure for machines. This Third Phase is also beginning to take place. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct link bases, based on some simple expressions of what data is of relevance, in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents.

Co-Evolution of the Web Infrastructure

Given the strong concordance between the phases of development of the document and linked data web, it is reasonable to make some predictions on how semantic web search engines, and additional supporting infrastructure, is likely to evolve by comparing them with the development of human search engines. For each of the specialisations and value-added features listed earlier its possible to see an equivalent for the machine-readable web:

Document Web Semantic Web Infrastructure Description
Google Image Search Type Searching Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate Vocabulary Normalisation Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search Community Constructed Data Sets and Indexes Ability to create and manipulate custom subsets of the linked data cloud
Google Trends Linked Data Analysis & Publishing Trends Identifying new data sources; new vocabularies; clusters of data; data analysis

These last two are particularly interesting as they suggest the need to be able to easily aggregate, combine and analyse aspects of the linked data cloud. This infrastructure will need to be able to support the community in working with data in a variety of ways, allowing data to flow and be collected where it is needed. Introducing a metaphor for this process might help highlight some of the processes and its consequences.

Flowing Data

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir.

Data is like water and flows of data are like streams. These streams of data can arise from any number of different sources: from a person entering data into a system; from a click stream generated as a side-effect of web browsing; application events; or generated from real-world sensor measurements. There are already many ways that we can tap into these data streams, using web-based query APIs, messaging systems like XMPP, or syndication protocols like Atom and RSS.

While these streams of data are already supporting a huge range of different applications and use cases, they are inherently limited: a stream has no memory. If historical context is required, e.g. to support more complex querying and reporting, then each consuming application must collect and store the data. We can think of these collections of data as pools; each stream of data on the web may feed any number of different application-specific pools.

A pool of data provides extra flexibility, but comes at the cost of requiring each consuming application to maintain its own infrastructure to hold copies of that data. Even if each source of data provides direct access to its own pool, e.g. by exposing a web-based query interface onto its database, or by exposing linked data, there are still unnecessary overheads. Each data provider must provide their own scalable infrastructure and support a rich set of data access options.

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir. A reservoir is a pool of data that is maintained by and services a specific community. Reservoirs allow issues such as quality of service (reliable supply of water) and infrastructure costs (building of pipelines) to be solved at a community level.

Its possible to argue that the web already consists of streams, pools, and reservoirs, but there is a distinct difference between a web based on semantic web technology and a Web constructed of a mixture of XML documents or similar formats: like water, at the molecular level, all RDF is the same; its all triples. Unlike alternatives, RDF data is more easily pooled and collected and so is much more amenable to explorations of shared infrastructure. Like a relational database, an RDF triple-store can contain an huge variety of different kinds of data. But unlike a relational database, an RDF triple-store, has the potential for the aggregate to be much more than the some of its parts. The seeds of convergence are built in, through reliance ah the most fundamental level on a global naming system (URIs) and standardised ways to state equivalence and relationships between resources.

In the real world, reservoirs do more than supply a community with water. The aggregate has its own uses: water skiing or hydro-electric power generation for example. And the same will be true of semantic web data reservoirs: large collections of data can be analysed and re-purposed in ways that are not possible – or at least not achievable without a great deal of repeated, redundant integration effort – using other techniques. The reservoir itself can be the source of new facts and new streams of data derived from analysis of its contents.

Flowing Data through the Talis Platform

The goal of the Talis Platform is to support the growth of the Linked Data ecosystem by providing the infrastructure to support the creation of pools of data. For additional background, see my article “Enabling the Linked Data Ecosystem” from Nodalities issue 5.

At present the Platform provides a range of services that allow data to be easily streamed into and out of Platform stores, allowing data to be easily pooled in order to benefit from greater context. Data can be pushed directly into the Platform and we are exploring methods of supporting other forms of data ingestion to make it easier and more natural to begin to accumulate data sets within the Platform.

The core search service, which produces its results in RSS, allows the creation of simple data streams, while the SPARQL interface supports more complex data extraction methods. The Augmentation service provides an interesting twist on these conventional approaches, providing a means for any RSS 1.0 feed to be automatically enriched with extra metadata by feeding it through a Platform data store. This means of interaction is like fishing for data: it is possible to serendipitously find and extract data, capturing it as extra context to items in an RSS feed, without having to deal with writing SPARQL queries or constructing a keyword search. There are many more methods and modes of data extraction that will be added to the Platform to add to these existing services; this is just the beginning.

But the Talis Platform is intended to provide much more than just the ability to work with pools of data. The bigger vision is to support the creation of true data reservoirs, and enable many different ways of manipulating and analysing their contents in order to discover new facts and bring new context to that data. Creation of these larger pools of content will need to be made sustainable for the communities that are creating them, and deriving value from them. Sustainability covers a wide range of issues that go beyond just commercial issues: quality and range of services are additional factors, as are forms of governance, trust and quality that relate to the data sets themselves. The Platform is intended to address all of these issues.

To take a small example, the experimental “store groups” feature that was released at the end of last year, provides a simple method for combining datasets, without requiring that data to be completely loaded or copied into a single database. The store groups feature will ultimately support a range of services over the constituent data sets, allowing each pool of data to remain intact whilst still contributing to the whole; this will be important to support the new forms of governance that are beginning to emerge around datasets on the Linked Data web.

Eric Hillerbrand talks about social commerce and the Semantic Web

In my latest podcast I talk with Eric Hillerbrand about his notions of ‘Social Commerce.’

We discuss some of the ways in which semantic technologies play a part in altering the traditional relationship between people and brands, consumers and retailers.

This conversation was recorded on Friday 17 April, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.

Ivan Herman talks about the Semantic Web and W3C

Ivan Herman pictureIn my latest podcast I talk with Ivan Herman, Semantic Web Activity Lead at the World Wide Web Consortium (W3C).

We discuss W3C’s continued engagement with Semantic Web activity around the world, touch upon current activity to enhance existing specifications such as SPARQL, and consider the success of the Linked Data meme.

During the conversation, we refer to the following resources;

This conversation was recorded on Wednesday 8 April, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.

Linked Data In(ter)action

By Benjamin Nowack

| This article will feature in Nodalities Magazine, Issue 6

During the recent months, the Semantic Web community is accelerating its progress around web-enhanced information and knowledge management. Specifications such as RDF and SPARQL are increasingly applied by developers and organizations, RDF software is maturing. Even the initial chicken and egg problem around data and applications has now been solved by the Linking Open Data (LOD) project, which is bringing dataset after dataset online, each following recommended practices for simplified information access and repurposing. The time has finally come to move on and create the distributed data applications we have been dreaming of for so long.

Just like the Web’s true innovation was not hypertext as such, but freeing it from isolated CD ROMs, the Semantic Web’s value proposition is not information integration per se, but doing it on a global scale. Network effects will play an important role and have to be considered by application developers. Mashups on a semantic web are not one-off combinations of existing sources and APIs. They will feed their added value back into a self-enforcing Linked Data Ecosystem, thus enabling chains of applications, with each reaping the benefits of the previous one. RDF developers these days often use terms like “Meshup” or “Hyperdata” to describe the direction they are headed.

Linked Data is all about portability and off-site use: The more a respective application attracts users, the more will it let them take their data with them and also integrate external sources. With a bit of luck, we will see not one, but a wealth of killer applications, where the “unique selling proposition” is personal and defined by each user individually.
Despite the ongoing advances, some pieces to the puzzle are still missing. This becomes clearer when we correlate the current state of the Linked Data market to a typical information life cycle classification. While we can name solutions for each value-increasing process (Creation, Organization, Utilization, Distribution, Discovery), the Utilization and Application stage represents a bottleneck. Products start to benefit from Linked Data, but few are also re-distributing their internally enriched information. Additionally, the Creation phase today is mostly driven by dedicated efforts such as the LOD project, although data manipulation and enhancing should also be possible right while people are interacting with semantic web content.

Linked Data Value Spiral

A few months ago, Talis researcher Tom Heath wrote an inspiring IEEE Internet Computing essay titled “How Will We Interact with the Web of Data?” where he described the upcoming challenges and opportunities in the context of human-computer interaction. He suggested that on a web where the granularity is increased from documents to arbitrary things, user interfaces should treat individual objects as first-class citizens, ideally providing context-specific functionality, direct manipulation, and coherence across personal usage scenarios. Application models that go beyond browsing and which are both universal and user-friendly are an ongoing challenge.

A system that aims at finding a sweet spot between simplicity and standardized interaction is Paggr (paggr.com). The basic idea is to combine successful Web 2.0 solutions and trends with Tim Berners-Lee’s concept of an “RDF Clipboard” for polymorphic data exchange between desktop applications. The required technical trick for copy-by-reference across desktop and web applications was introduced by Ray Ozzie three years ago through his “Live Clipboard”. Around the same time, AJAX and converging browser capabilities mass-enabled interactive HTML elements, and personal portal builders such as Netvibes brought widgets and drag and drop to end-users. The amount of open datasets and technical possibilities finally led to a first prototype for building Linked Data Dashboards a few months ago.

The system used Netvibes-like pages with three resizable colums that could be populated with so-called Sparqlets. A Sparqlet is a SPARQL-powered widget, defined by a set of queries and result templates. The output consists of machine-readable HTML which addresses three essential requirements:

  • Widgets can easily be copied to other dashboards, their complete definition is retrievable via HTTP (by de-referencing the widget identifier).
  • Individual items in a widget can be interactively linked to other items, as each element is associated with a URI. This makes semantic drag and drop possible, such as dragging a person representation on a map or an address book widget.
  • Being able to instantly feed augmented data back into the personal or public data cloud.

Architecture

The prototype received encouraging and very helpful feedback at the International Semantic Web Conference (and even won a prize). We are clearly not ready for the mainstream user yet, but building on established interaction models seems to be a promising acceptance strategy. The next iteration of Paggr is now almost finished and we are looking forward to putting it online. The first public applications will be limited to focused use cases (such as an organizer for conference attendees) as we are still working on certain interface behaviors, but a private alpha phase with less restrictions is planned, too.

Linked Data Dashboards face a number of usability challenges. The big question is how to tie the wealth of possibilities to a generic user interface without sacrificing work efficiency. Application convenience often boils down to feature reduction and contextual options, possibly combined with shortcuts for common tasks. To reduce complexity, Paggr lets the user (or app creator) break the theoretically infinite possibilities down into separate dashboards, where options and relations can be further spread across widgets.

The more complicated part starts at the widget level. Semantic drag and drop is often multi-modal. Dragging an event on a calendar does not necessarily mean “Add”, there are many ways to link two persons to each other, etc. Also, working with Linked Data is sometimes like having a backstage pass for a concert: very exciting, but also a bit rough, easily overwhelming, and if you open the wrong door, you can quickly find yourself getting kicked out. Raw data (or equally ugly RDF/HTML dumps) are always just a link away, application designers will try to carefully shield non-developers from being exposed to things like DBPedia pages. For developers, on the other hand, this equivalent to the early Web’s “view source” feature can be very valuable.

Now, what exactly are the requirements and nice-to-haves, and (how) can they be implemented through widgets without leading to cluttered screen estate? As mentioned above, in order to support drag and drop as well as copy and paste between different browser tabs or even at the operating system level, we can use a technical trick introduced by Live Clipboard: transparent form fields that natively provide “right-click / paste” and similar functionality. For a consistent user experience, this means that we need distinguishable (but unobtrusive) fields for each interactive element. In Paggr, small Semantic Web icons next to widget items and title bars signal the availability of advanced options. They enable:

  • widget filtering
  • copying widget or item identifiers
  • removing items from and adding items to widgets
  • interlinking individual items
  • custom contextual menus

Paggr Widget

The approach of using dedicated interaction zones has desirable side-effects. Non-expert users are less likely to get confused, as the general markup keeps its expected behavior. It also becomes possible to disable the semantic extensions simply by deactivating and hiding the icons. A public dashboard or shared meshup may look and feel just like a normal website.

There are still several unresolved issues left and future iterations could well require a complete re-design, but Paggr is just one of a growing number of consumer-oriented Linked Data systems. After years of hard infrastructure work, the Semantic Web community is finally starting to benefit from the investments. Data-wise, we have probably reached the tipping point already. Even former critics start to make their information available in RDF, efforts like microformats, once regarded as competitors, have become accessible from SPARQL, and services like OpenCalais, Yahoo!’s SearchMonkey, or the Zemanta API are constantly reinforcing the network effects of structured open data. It should only be a matter of months until we are going to see the first fully-fledged Linked Data applications for end-users.

Benjamin Nowack is the developer of Paggr. He runs semsol, a tiny Semantic Web agency in Düsseldorf, Germany.

Discovering SPARQL

By Alex Tucker

| This article will feature in Nodalities Magazine, Issue 6

Confessions of a Semantic Web Junkie

Let me start with a confession: I’ve been banging on about the semantic web to anyone who will listen to me for the past nine years.  For some apparently deep seated reason, which some have even labelled perverse, I’ve kept at it despite endless meetings, misunderstandings, off-handed dismissals and blatant refusals to accept what seems to me to be obvious.  It’s been a long and sometimes despairing nine years, but looking around now at the state of all things semantic web, one can’t fail to realize that it’s finally taking off and that companies like Talis are fully committed to its success.

During much of this time I’ve been working for various defence organisations showing how using semantic web standards and tools can help solve issues of interoperability — getting different systems to talk to one another.  That the semantic web, and more specifically the semantic web ontology language OWL, can address these kinds of issues shouldn’t be too surprising, given that the US Defense Advanced Research Projects Agency (DARPA) helped fund the process which eventually led to OWL, precisely to solve issues of semantic interoperability.

More recently, I’ve had the pleasure of working with various teams and projects within NATO who have embraced the semantic web ideas and expanded on them in novel ways.  A fundamental shift in attitude in defence, which has in part been driven by the 9/11 Commission’s findings, is that intelligence agencies must move from a ‘need to know’ to a ‘need to share’ mentality.  For me, this shift has echoes in the ‘linked open data’ ideals, and for someone who also bangs on about open source, open standards and all things open, it’s a step in the right direction.  Obviously, sharing information in a military context is a little different to the ideals of linked open data, but the fundamental issues around getting the information out of existing silos and usable by other systems are much the same.

A recent success in one such NATO project has been the decision to move from using a bespoke query protocol for RDF, a sort of ‘query by example’, to using the World Wide Web Consortium’s new semantic web query language, SPARQL.  This might seem obvious, but a few years ago when the protocol was being developed, SPARQL didn’t exist — therein lies another discussion on how organisations need to be more agile to be able to cope with the length of a ‘Web Year’.  As far as the project group are concerned however, this is great news, as they don’t have to come up with a ‘Standardization Agreement’ or STANAG to define their own query protocol and can instead just point to and use an existing standard, SPARQL.

Bonjour

Another standard used by this project is Bonjour (formerly Zercoconfig or Rendezvous), Apple’s open standard for no-nonsense network configuration and service discovery—the technology behind iTunes’ ability to discover and play music from other iTunes applications on different computers. The great thing about using Bonjour for this project is that it’s decentralized: there’s no need to set up the registry of information providers beforehand. Instead, each provider is free to publish their existence and their capabilities, both on a local and wide area network, without setting up any new infrastructure. Each service in a local network can publish its details in any number of domains. The clouds above represent local networks, the arrows represent the act of publishing, or registering service details to a domain, the cylinders represent services, and the ‘.local’ is the domain for the local network.  iTunes, for instance, currently publishes itself only to the local area network (which is different to the initial iTunes release where people realised they could share their music with friends over the internet by advertising their iTunes database to their own domain name, how terrible).  The domains themselves correspond to the normal domain name system (DNS) names we’re used to, since DNS is really at the heart of the Bonjour protocol.

Services can then be discovered simply by looking for a particular service type in a particular domain; a typical Bonjour service discovery query would ask for all printer services in the local domain.  These service types are given slightly cryptic names (iTunes, for example, uses _daap._tcp), but are all listed on-line at dns-sd.org/ServiceTypes.html for everyone to see and re-use.

Hello SPARQL

What we’ve done then is create a service type for SPARQL, allowing information providers to publish their SPARQL ‘endpoints’ to be discovered by information consumers.  Technically, the SPARQL service type is _sparql._tcp and is listed at dns-sd.org/ServiceTypes.html along with a short description of the properties which should be published.

In keeping with the RDF mantra that “anyone can say anything about anything”, we’ve tried to ensure that anyone can publish a description of not only their own SPARQL services, but also those of others.  A published SPARQL service record has two properties, one called ‘path’ and one called ‘metadata’, from which a client derives two URLs, the former pointing to the SPARQL service endpoint, and the latter pointing to some arbitrary (RDF/XML encoded) metadata it can fetch. The normal approach would be to just use simple paths (e.g. /sparql and /service-metadata.rdf) for the values of these properties, which would be interpreted as pointing to the discovered host (e.g. http://dbpedia.org/sparql and http://dbpedia.org/service-metadata.rdf).  However, by using a full URL for the value of either property, a service record can be published which points to a service or some metadata on a different host entirely, allowing us to form what are essentially proxy records for existing SPARQL services which don’t (yet) use Bonjour service discovery.

The Bonjour specifications are firmly geared towards making the user’s life simpler, so for instance while the name of a service is really just a normal DNS name, the specifications insist that it should be a human readable name with proper capitalization, spaces etc., although no dots are allowed.  The specifications also give guidance as to the sorts of properties and values a service record should contain, and are keen that service records be as concise as possible and leave much of the nitty-gritty details of discovering service capability to the main protocol, in our case SPARQL.

Into the voID

However, the SPARQL protocol doesn’t have much to say about what to expect of an endpoint other than, “here it is.”  Our approach to this has been to allow a little bit of extra information in the service record, for instance using the ‘vocabs’ property to declare the URIs of any vocabularies the service uses.  To find out more information about a service, a consumer can use the value of the ‘metadata’ property to fetch a fuller service description.  The only restriction is that this service description should be marked up in RDF/XML — what better way to encode metadata?

As to what this service description metadata should contain, well that’s currently left to the provider.  In an ideal world, using RDF and OWL to describe a service would mean that it is completely self-describing and that a consumer could fetch the document, fetch any referred ontologies, and figure out what it all means completely automatically.  In reality, we at least need some existing, vocabulary to refer to, even if the client can infer and interpret meaning using ontologies.  For us, that vocabulary is a bespoke ontology which is currently being worked on, but which is good enough for our needs right now.

So I was heartened to be shown a glimpse of voID last year, which upon further reading looks to offer exactly the sort of service description vocabulary needed to complement our Bonjour SPARQL service type.  Another great thing about RDF (see, I’m such a zealot), is that an RDF document can easily use multiple vocabularies, and the instance data doesn’t even necessarily need to be linked together, so the service description document can happily accommodate the use of  more than one service description vocabulary or ontology without breaking anything.  We’ll see what happens, but my vote would currently be for voID to be the suggested minimum requirement.

Why?

There’s currently a list of SPARQL endpoints on the World Wide Web Consortium’s ESW wiki, along with a comment along the lines that the list will probably not get much bigger in the long run.  The comment itself is well reasoned and not necessarily meant to be a negative one, but along with similar comments from peers certainly makes us stand back and wonder at the usefulness of any kind of registry, or even a decentralized set of registries, of SPARQL services.

There’s no doubt that the project I’m working with needs this right now, so it’s certainly useful, but what about when, and if, things get much bigger?  A well known web technologist has published Bonjour service types for HTTP and HTTPS at dns-sd.org, but lately there doesn’t seem to be much take-up for this kind of facility, even if Apple’s Safari web browser has built in support with its Bonjour bookmarks feature.

On the other hand, Google does such a wonderful job, giving us a single endpoint to query the whole of the web, that it’s tempting to think that the semantic web will ultimately be sucked up in a similar fashion into a great big semantic data warehouse in the clouds, and there’ll be no need for anyone to offer up their own SPARQL endpoints.

My own expectation is that we’ll end up with something in between.  On the one side there will be one or a few massive Google like entry points which will, by spidering across the web, perhaps following the linked data trail, suck up much of the semantic web and present it to users in a simple, easy to consume and re-use entry point.  But just as when you’re searching for something specific, for instance buying a book, or trawling through your bank statement, you don’t start at Google, I think there are plenty of cases where you’d want to be able to automatically discover SPARQL endpoints which you can access directly and which offer you precisely the details you need, there and then.

The semantic web is different to the (syntactic?) web, in that it makes it much easier to take information from multiple sources and join it together into something new and more useful.  Imagine if everyone’s desktop were part of the semantic web, with all the information about from emails, photos, music etc. offered up through SPARQL services (take a look at the Nepomuk project for an existing implementation of this).  Imagining the sorts of applications we could build on top of the Semantic Desktop is an exciting prospect, but it’s not something I’d expect most people would want to make available through a single Google style data warehouse, even if there were privacy safeguards in place, in the same way that most people don’t want to necessarily share all their photos on Flikr or Picasa.

Using Bonjour to discover SPARQL services means that a user can easily create or select a list of domains where their client applications will look for published enpoints, be they public or private, and given the right credentials they will be allowed to use those services to build the sorts of applications they want, no matter how esoteric: show me a list of all the photos of my son, sorted by the number of teeth he had, using my local photos as well as the photos taken by his grandparents; give me a list of washing machines, along with prices from these suppliers, where the washing machine is no wider than 60cm and has a Which Best Buy rating over 70%.

Of course, if we start making all this sort of information available through our own SPARQL services, then there are all sorts of issues around trust, privacy, provenance and accuracy, but at the very least we are in control of our own data.

Try It!

Using Bonjour to publish and discover SPARQL services is simplicity itself, and I invite you to take a look at www.floop.org.uk/eagle/discovering-sparql for some command-line examples.

While at the moment none of the SPARQL servers out there publish their details using Bonjour, I’ve created a simple application for Tomcat which can publish any of the services it runs using Bonjour, and have used it to publish Joseki based SPARQL services.  I’m also working on both a SPARQL endpoint and Bonjour publishing for Plone and Zope.  See http://www.floop.org.uk/projects for more details of these nascent projects.

What would be great is if the existing SPARQL servers and triples stores out there could add optional support for registering the details of any endpoints they make available, using Bonjour.  It would certainly help the uptake of SPARQL in the projects I’m involved with.

Alex Tucker is a self-employed semantic web consultant, specialising in defence.

Jeff Pollock talks about his new book, The Semantic Web for Dummies

In my latest podcast I talk with Oracle’s Jeff Pollock about his recently published book, The Semantic Web For Dummies.

We discuss the rationale behind the book and its intended audience, before turning to consideration of the impact that Semantic Technologies are having across a range of sectors.

During the conversation, we refer to the following resources;

This conversation was recorded on Tuesday 24 March, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.