Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

RDFa and Linked Data in UK government web-sites

By Mark Birbeck

| This article will feature in Nodalities Magazine, Issue 7

The UK government’s Central Office of Information had a straightforward problem to solve: how could they create a centralised web-site of information that the public could search and access, when the source of that information could be any government department
database or any public sector web-site?

For example, different organisations, such as Her Majesty’s Revenue and Customs (HMRC) or the National Health Service (NHS) would each post job vacancies to their own web-sites, but there was no central site that the public could go to, to find all public sector vacancies. This would be a problem at any time, but in the midst of attempts by the government to help people through the recession, it’s crucial to ensure that the public knows what vacancies are available. It might not occur to someone looking for a job as a plumber or an electrician they they should visit the NHS or Army web-sites, so a centralised site could make a big difference.

civil-service-vacancy

Similarly, as in most modern democracies, government departments are constantly seeking feedback from the public and interested parties, about specific issues. But as with job vacancies, these consultations are on departmental sites, rather than being available on a central site; from the Department of Energy and Climate Change (DECC) seeking feedback on clean coal, to the Ministry of Justice (MOJ) providing an opportunity for people to comment on prisoners’ voting rights, each department manages its own publication of consultations.

Traditional solutions

Traditional answers to these problems would have been to either (a) impose on each of the departments that they should key their data directly into a new central database (which would in turn drive the central web-site), or (b) create complex communication pipelines that would allow the decentralised databases to communicate with the central system.

And either of these solutions would almost certainly have turned out to have been a non-starter.

The first solution was unlikely to ever get off the ground, because it would have required each department to replace their existing technology with something new. Even if there was agreement on what that technology should be—and that in itself could take an age to resolve—there would have been a need for new development work, retraining of users, porting data from older systems, and so on.

The second ‘traditional’ solution at least has the merit of keeping existing systems intact, but would have required additional interfaces to be created to move the data from the departmental servers to the centre; each department would have had to create an interface between their own system and the central one.

Just getting one department into a situation where they could centralise their information would have been a major undertaking—not only were there lots of departments to consider, but each department was using a different technology to publish their vacancies or consultations to the web. For example, some departments with only a small number of job vacancies would likely use static HTML pages. Other departments, perhaps with larger IT departments, might use ASP.NET or a Java-based system.

Enter RDFa

The RDFa answer to this set of problems is simple—both conceptually, and to implement.

RDFa allows HTML publishers to embed RDF into their pages, so using the HTTP and HTML infrastructure to publish their information. This simple method of publishing data in turn means that any system can import this data, just by obtaining (or creating) an RDFa parser.

In short, each department can keep their own data management system, and simply add code to their existing web-page publishing step to augment the HTML with the data as RDFa. The central system in turn only needs one import mechanism—something that understands RDFa.

Adding this facility to an individual departments publishing system proved to be very quick and straightforward. But it’s not just UK government departments that are finding it straightforward to add RDFa to their pages. It was interesting to hear at SemTech in June that Google’s rich snippet launch partners (such as Yelp), were able to add RDFa support in “roughly a day”.

RDF publishing techniques

Adding data to web-pages might seem quite an obvious technique, but there are two important things to note here.

First, the COI has to be commended for having the vision to publish RDF at all. Of course, now that Gordon Brown has asked for Sir Tim Berners-Lee’s help in making government data publicly available, it seems pretty obvious—indeed it may even become fashionable! But the COI were planning this project at least a year ago, and at that time RDF was by no means a done deal (and you could say it’s still not).

But the second important thing is that even after deciding to publish RDF, it’s still not immediately obvious that the solution should involve RDFa, especially not a year ago.

The usual means of publishing RDF is to provide a distinct source of data in the form of RDF/XML (and perhaps other formats, too, such as N3). If there is an HTML version it usually exists for the purpose of describing the data itself. In other words, the RDF/XML format is primary, which means that anyone who is publishing HTML pages but wants to publish RDF as well, will need to add an extra piece of infrastructure that exists alongside their web-pages.

RDFa turns this on its head, and says that the HTML page is the data. One and the same page can be read as an HTML page, or as an RDF page, which in turn means that the changes required to the existing publication system are minimal. The COI once again showed its far-sightedness by adopting this technique.

Turtles all the way down

searchmonkey-fcoBut the benefits of RDFa don’t just stop there. Firstly, because the data is being published via HTTP and HTML, it’s possible for anyone to read the same data, not just the centralised web-site that was being planned. This means that third party job vacancy sites, for example, could import vacancies from relevant departments, to add to their databases. In fact, one of the main drivers for the consultations project was to try to help improve the accuracy of an already existing web-site (set up by a member of the public) that used ‘screen-scraping’ to try to keep up with the available consultations—RDFa provides much more accurate information.

rdfa-in-govIn addition, the centralised web-site will not only import RDFa but publish it too. This means that third-party servers are also able to import some or all of the centralised data, into their own sites.

And thirdly, by using RDFa the sites could provide information to search applications such as SearchMonkey.

As more servers both consume RDFa from one set of servers, and publish RDFa again to a variety of other servers, we enter the exciting world of Linked Data, and it’s ‘turtles all the way down’.

Conclusion

By using RDFa to address the challenge of making distributed data available in one place, the COI avoided having to make changes to each department’s systems. But once each department is publishing RDFa, it becomes possible for third parties to consume that information however they see fit. Such a flexible architecture is crucial in the age of open government, and is a cornerstone of linked open data.

Mark is managing director of Backplane Ltd. (http://webBackplane.com/), a London-based company involved in a number of RDFa/linked data projects for UK government departments. He is the original proposer of RDFa.

Linking Data and Semantics at O’Reilly

By Gavin Carothers and Charles Greer

|This article features in Nodalities Magazine, Issue 6

O’Reilly Media lives on the cutting edge. We coined terms such as Web 2.0, created the first commercial website in 1993, and exist to “spread the knowledge of innovators.” With our evangelists, conference presenters, authors, and bloggers all communicating and catalyzing new ideas, many believe that O’Reilly must be just as technologically innovative in our own operations. However, O’Reilly employs about 200 people but only half a dozen developers, so naturally ideas are thrown at our developers faster than it is possible to implement them. We’ve been known to refer to this tension between our public position on the cutting edge and internal expectation to live up to what we preach as “gaping wound tech.” Any time someone had a new idea or a new product to launch that didn’t quite fit into existing systems, we found some way to shoehorn it in, with a quick Perl script or some clever custom SQL. As we did this, more and more of our work became preventing our systems from collapsing under the weight of those one-off ETLs and scripts. The cost of simply keeping track of which scripts were using what bit of transformed data and where that data came from had became so high as to become unsustainable. We’d accrued so much design debt that only the most radical of approaches could save us from being crushed by the weight of our inherited code.

Of course, we didn’t really know that at the time. Today we have a Linked Data, Semantic, RESTful, URI-based, highly buzz-wordy solution mostly by accident and through ruthless pragmatism. Instead of embracing the ideas of the Semantic Web at the outset, we arrived at the Semantic Web because it was the only solution. We thought we were traveling down two completely unrelated roads. We started down the first while trying to replace a Java Bean Shell script that copied book content to a few different places. The other road began when we wanted to know what color to make the border of a PDF. The first would lead to an Atom Publishing Protocol server and clients, the second to our modeling all product metadata in RDF and opening that to the public.

As it turns out, the two roads weren’t so unrelated after all. RDF is designed to handle modeling information in a distributed manner and provides the underpinnings for the actual metadata we store, aggregate, and use. AtomPub’s RESTful interface is ideally designed for managing individual chunks of all this distributed data over time and provides programs and people a simple, standard interface for publishing, accessing, and updating it. As we progressed down each path, we were making (often unknowingly) major progress in generating linked data and semantics, the two pillars of the Semantic Web.

The RESTful Road

In 2005, soon after O’Reilly launched a custom book publishing platform, we discovered that we’d deferred a hard question. We didn’t know how to make sure that we could easily add new books as they came down the production pipeline. The canonical representation of nearly all O’Reilly titles is DocBook files. Historically, these DocBook files were scattered across many filesystems, transformed by people using one-off scripts, and arbitrarily transmitted using FTP to other filesystems. We simply didn’t have a way of addressing fundamental questions like “Where is the latest, cleanest copy of a book’s markup?” Tracking down the best representation of a book’s content was a laborious, error-prone task.
Around the same time we ran into this, we noticed Tim Bray’s superb presentations about the then-draft form Atom Publication Protocol. The architecture proposed by RESTful advocates like Bray and embodied by what would become RFC5023 gave us the ability to store an atomic chunk of data, assign it a URI and access and update it through a standard interface.

  • A book’s ”source code“, the DocBook markup
  • The print book, as an ISBN
  • The table of contents
  • A HTML, PDF or other representation generated from the source
  • Whatever Tim O’Reilly or the business folks asked for next

O’Reilly’s SafariU was a business venture that implemented these kinds of transformations of content, but didn’t expose anything but it’s own web browser interface.  When considering how to leverage SafariU’s technologies in the business as a whole, we arrived at this:

This atom:entry is the “latest, cleanest copy of a book’s markup” and its URI is the canonical location for this content. Additionally, the entry provides different views of the content using 17 distinct <link/> elements We had embraced the linked data idea Noun = URI.   Around the same time, we realized that while we needed a way to address various available formats of content, we also required a place to store and maintain our digital assets.   By implementing the Atom Publishing Protocol we established a generic way to maintain our assets, as Nouns, over time.  Now that systems could reliably find and update our content using URIs, it became painfully apparent that we still had a major uphill battle—how to do the same thing for product metadata?

A similar problem existed when dealing with metadata. Distinct applications were completely unintegrated and focused only on the browser and human users. They provided no visibility into their data for other systems.

rdf:isNeat

“Can our PDFs have the same branding and colors as the printed books?” —Marketing Person
“Sure! How hard can it be?” —Innocent Developer

At this point in our journey we have more than 900 titles in the AtomPub repository and addressable by URI. We’ve (unknowingly) hit a significant Linked Data milestone and everything is progressing well. Dynamically creating a PDF from these entries is as easy as running our DocBook-XSL customization for the correct series to produce XSL-FO and then rendering that XSL-FO into PDF. The only problem was discovering which series (In a Nutshell, Animal Guide, Missing Manual) the content fell under. At that point all progress stopped.
Our definitive source of book and product information is the Product Database (67,000+ lines of Perl, C++, SQL, and a dozen other languages). The database and web application has its own home-rolled “XML Format,” as I’m sure many other companies have had. Based directly on the column names from the SQL database, our Book XML was a quick and very dirty way of getting our centralized relational data out into the world as XML. A host of new client applications grew around this new access to product data, but we quickly saw the problems of reusing an adhoc, undefined, schema-generated format. The XML service was also incredibly slow.

<IPFamily>

<Book>
<product_id>5549</product_id>
<parent_product_id>6380</parent_product_id>
<imprint_id>1</imprint_id>
<product_status_id>5</product_status_id>
<product_type_id>10</product_type_id>
<isbn>0596515618</isbn>
<isbn13>9780596515614</isbn13>

<final_date>2003-07-02</final_date> <!-- Actually the day the last QC phase ended -->

...


As you can see from the snippet above, clients had to deal with knowing exactly what imprint 1 (O’Reilly Media, Inc.) and product type 10 (PDF) meant. Each client kept mappings of these magic values in order to make the data understandable. Those mappings broke, of course, whenever new product types and imprints were added. Even more dangerously, because the semantics of the XML were totally unspecified, element names were opaque and sometimes actively misleading. We might have redesigned the format to include more data and added more and more fields to it but this wasn’t an explicitly designed schema, just something generated from the SQL. On the road to exposing this data more cleanly we tried everything. Remodeling the SQL to be more relational didn’t offer much benefit and we still couldn’t tell what the column names meant. Sitting down and trying to write up a data dictionary was a great exercise, but it became out of date almost immediately. We experimented with JSON-based CouchDB prototypes, but those had the same issue as the SQL with missing meaning. Our Subversion repository is littered with Relax-NG, XML Schema, and Schematron documents to create new XML-based format. Somehow they never got finished as we discovered we either had to define everything or try to design for extensibility. We knew we didn’t have the time to create our own Book Metadata Standard. We wanted defined semantics.
There is at least one obvious XML vocabulary for a publisher looking to capture book metadata: ONIX. Unfortunately, the ONIX standard is archaic, with obscure element names like b004 (ISBN) and g343 (PrizeJury, obviously) (Footnote: Yes, these are the short versions and a longer set of names is also allowed. However, many of the most important vendors only support the short versions.) We did consider ONIX for a time, but then we noticed that every vendor we sent ONIX to treated the fields a bit differently. Even with pages and pages of specification there wasn’t any agreement on what elements were important or what they meant. Using ONIX as a format would not solve our semantic deficiency, we still wouldn’t know what the “columns” meant.
In the process of trying to create an XML format we asked a number of people in the company how to find the Publication Date for a book. The answer was surprisingly complex. The value was computed independently by each of the ETL hydras, with subtly different implementations that had evolved with particular client needs. O’Reilly isn’t a huge company with layer upon layer of bureaucracy; most questions can be quickly answered with a chat at a desk or an email to the other coast. Imagine our surprise, then, at the results of the Publication Date poll. Most people were confident that one of five dates was the right date, but disagreed on which of the five it was. Retail Availability Date, Actual In Stock Date, Estimated In Stock Date, etc each had its backers. What was really going on was that we discovered the subtle different needs that each business unit had.  The strategy we could most easily support?  Concensus on a public standard.  As we’ve learned so many times, we needed to go outside the company to find the correct solution. Public standards, specifications, and ontologies could save us from ourselves.
Enter: Dublin Core. We couldn’t define our own format or use the industry standard (ONIX), nor could we agree on what a publication date was. Our only choice was go borrow/steal some other group’s ideas. It turns out that our problems had already been solved by the library community. The Dublin Core Metadata Initiative created standards, guidelines, and examples for storing and sharing basic, essential metadata. We had a way out, here was a group of people who’d already done a great deal of thinking for us.
Of course, they hadn’t done all our thinking for us. Mapping all of our old data into well-designed and well-documented Dublin Core, MARC Relators, FOAF, or any other ontology was going to be hard. So we didn’t do it. Instead we mapped the whole of our old, horrible, ugly mess into an undefined ontology called the “Product Database Legacy Ontology.” We then moved some of the more obvious items like title and author into Dublin Core and waited. Only once we had a proven need for a new data point in real application would we go though the process of researching, defining, cleaning, and moving it into a modern, public ontology. For those following along closely: no, trim color isn’t yet in the public or internal metadata. As it turns out, no one really wanted it. At least, not yet.

All Together Now

Since Gavin’s first frenzied port of product metadata to an RDF model, we’ve been able to negotiate changing requirements, establish data validation and control rules, and bring on new applications with little time spent on data modeling. In other words, meeting our immediate need of a centralized, validatated data store of high agility and performance has paid off several times over in deploying new software systems for the rapidly changing company.
One example of the intertwining of Linked Data and Semantics is our Electronic Media distribution system, which lets customers download ebooks, pdfs, videos and the like. Book descriptions, titles, authors names, cover images even the help text provided on the Electronic Media page is simply linked data, built from RDF relationships. When we want to change the help text or a category label, we change it in one document, and everything else in the RDF graph referencing it changes with in moments as well. Just following links pays off.
Previously, the buttons that let a customer add a book to our shoping cart were generated by a system that used nightly ETLs nicknamed “the sync”. So new products would have to be prepped for release the night before. We gave special care to their timely appearance in the morning. Alas, they frequently did not appear as hoped, as the ETLs that made up “the sync” had to run in a very precise nightly schedule or we had to take manual corrective action. Now, a reasonably simple HTML template bound to the RDF for a book generates “Buy Buttons” in near realtime without an ETL in sight.
The greatest challenge of updating our legacy IT infrastructure hasn’t been replacing the ETLs or synchronization. It’s been achieving consensus on the meaning of data elements. In the past, data maintainers might adjust the title of a book to change how retailers present it. Then our website’s title would change (the next day), and we would have to bring resources to bear on reconciling the meaning of “title.” By using for our title element, we’ve established what to expect from those who change the value. It’s simpler to make sure people enter particular kinds of data, and then ask for help to extend or change requirements for downstream apps. The publicly available ontologies, we hope, will help everyone communicate more effectively about business needs and shared data points. So far the results are encouraging.

In the Public Eye

Having built several of our own applications using our new RDF metadata and our initial linked data APIs, we thought it might be a good idea to let someone else have a crack at it too and see what they made of it. It took us two weeks to develop the O’Reilly Product Metadata Interface, a simple layer on top of the Deli. A caching proxy preserves the reliability needed by our own applications, while a predicate filter prevents private information from leaking to the public. A bit more about how you can access it can be found at http://labs.oreilly.com/opmi.html or you can just dive right in by giving it an ISBN, IE: http://opmi.oreilly.com/product/9780596529260.
Sharing our work with the public forced us to be much more deliberate and rigorous about our data, but also exposed some simple blunders. On the day we launched the service we waited for the praise to come in and finally saw a tweet! Someone is using… Oh wait:

OPMI’s book identifiers aren’t resolvable. Sigh.” —Jeni Tennison

“Of course they’re resolvable,” we thought. “You just have to parse the URN and understand how to pass the URN to… oh, yeah good point.” In the process of implementation, we’d forgotten Tim Berners-Lee’s second rule of Linked Data:

2. Use HTTP URIs so that people can look up those names.

At the start of the process we’d talked about about using some sort of identifier for our products. But that conversation had taken place before we really had all the RDF and Linked Data applications working, so at the time there wasn’t any point nor could anyone see the need for a resolvable identifier. Within a few hours of making the data public, the need became blindingly apparent. Part of embracing “anyone can say anything about anything” is that anyone needs to be able to find the anything they want to talk about. And when you’ve got a statement to make, it’s remarkably handy to be able to quickly find out what else has been said. “I loved urn:x-domain:oreilly.com:product:9780596529260.BOOK” is a bit hard to figure out. “I hated http://purl.oreilly.com/product/9780596529260.BOOK” is a lot better.

Streams, Pools and Reservoirs

by Leigh Dodds
| this article features in Nodalities Magazine, issue 6

As we start to move past the current boot-strapping phase of the semantic web in which we are constructing the web of linked data, its useful to begin discussing what other feature and infrastructure we need in order to support sustainable usage of this huge and growing data set: what services can be offered over linked data? Do we need to consider how to provide quality of service, stability and longevity to the data, or does the sheer scale of the web make these moot points?

In order to answer this question it’s useful to compare the ongoing development of the linked data web with that of the web itself.

A Brief History Lesson

There have been several phases of activity in the development of the web. While in truth, these phases were of different duration, overlapped with one another, and have happened at different rates within different communities, essentially we have gone with the following basic steps.

Firstly we concentrated on just getting stuff on line. The early web was a new medium for document and data exchange and so was at its core a simple publishing device used as a collaborative space between small communities. But as the amount of content and the size and breadth of those communities grew, the emphasis shifted towards linking: tying content together to create, – initially hand-crafted – indexes of the web and knit the available content into a greater whole.

The second, manual linking phase was quickly supplemented by a third phase of automated linking between content: search engines. A search engine is simply a way to quickly create a link-base based on some search criteria. The crawling and indexing of the document web by web crawlers allows users to quickly construct links to content of potential interest.

If we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

The third phase of the web’s development has been triggered by the commoditisation of search and the need for search engines to differentiate themselves and offer additional value-added services. Search engine features are now tailored towards particular uses or types of content (Google Image Search; Google Scholar); offer value-added features that capitalise on the ability for search engines to analyse the structure and traffic flows across the web (PageRank and similar indexing improvements; Google Trends); expanding the audience for content (Google Translate); and enabling community-driven customisation of the search experience (Google Custom Search; Yahoo Search Monkey, etc).

No doubt there will be subsequent phases of development, and the perspective of history will let us tease out common strands of development some of which will already be happening. But if we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

History Recapitulated

There has been RDF data available on the web for many years, used by a limited community of researchers. This slow accumulation of content – echoing the first phase of content publishing on the document web – has been replaced by a rapid increase in data publishing encouraged through the Linking Open Data (LOD) project. By providing clear pragmatic guidance and instructions on how to publish data for the semantic web, that project has enabled us to accelerate our transition through that first content publishing phase. But it has also, crucially, encouraged the linking together of data sets (Phase 2).

This linking has to a great extent been manual. Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

The rapid growth of the Linked Data Cloud means that this “manual” phase will soon be over: there will be sufficient momentum behind the semantic web that increasing amounts of data will become available and no single community will be able (or need) to shepherd its development. The focus will shift towards the subject specific communities who will instead co-ordinate at a more local level. Semantic web search engines will also become a reality.

Semantic Web search engines need to be distinguished from semantically enabled search engines. The latter use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans. A Semantic Web search engine should offer infrastructure for machines. This Third Phase is also beginning to take place. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct link bases, based on some simple expressions of what data is of relevance, in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents.

Co-Evolution of the Web Infrastructure

Given the strong concordance between the phases of development of the document and linked data web, it is reasonable to make some predictions on how semantic web search engines, and additional supporting infrastructure, is likely to evolve by comparing them with the development of human search engines. For each of the specialisations and value-added features listed earlier its possible to see an equivalent for the machine-readable web:

Document Web Semantic Web Infrastructure Description
Google Image Search Type Searching Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate Vocabulary Normalisation Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search Community Constructed Data Sets and Indexes Ability to create and manipulate custom subsets of the linked data cloud
Google Trends Linked Data Analysis & Publishing Trends Identifying new data sources; new vocabularies; clusters of data; data analysis

These last two are particularly interesting as they suggest the need to be able to easily aggregate, combine and analyse aspects of the linked data cloud. This infrastructure will need to be able to support the community in working with data in a variety of ways, allowing data to flow and be collected where it is needed. Introducing a metaphor for this process might help highlight some of the processes and its consequences.

Flowing Data

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir.

Data is like water and flows of data are like streams. These streams of data can arise from any number of different sources: from a person entering data into a system; from a click stream generated as a side-effect of web browsing; application events; or generated from real-world sensor measurements. There are already many ways that we can tap into these data streams, using web-based query APIs, messaging systems like XMPP, or syndication protocols like Atom and RSS.

While these streams of data are already supporting a huge range of different applications and use cases, they are inherently limited: a stream has no memory. If historical context is required, e.g. to support more complex querying and reporting, then each consuming application must collect and store the data. We can think of these collections of data as pools; each stream of data on the web may feed any number of different application-specific pools.

A pool of data provides extra flexibility, but comes at the cost of requiring each consuming application to maintain its own infrastructure to hold copies of that data. Even if each source of data provides direct access to its own pool, e.g. by exposing a web-based query interface onto its database, or by exposing linked data, there are still unnecessary overheads. Each data provider must provide their own scalable infrastructure and support a rich set of data access options.

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir. A reservoir is a pool of data that is maintained by and services a specific community. Reservoirs allow issues such as quality of service (reliable supply of water) and infrastructure costs (building of pipelines) to be solved at a community level.

Its possible to argue that the web already consists of streams, pools, and reservoirs, but there is a distinct difference between a web based on semantic web technology and a Web constructed of a mixture of XML documents or similar formats: like water, at the molecular level, all RDF is the same; its all triples. Unlike alternatives, RDF data is more easily pooled and collected and so is much more amenable to explorations of shared infrastructure. Like a relational database, an RDF triple-store can contain an huge variety of different kinds of data. But unlike a relational database, an RDF triple-store, has the potential for the aggregate to be much more than the some of its parts. The seeds of convergence are built in, through reliance ah the most fundamental level on a global naming system (URIs) and standardised ways to state equivalence and relationships between resources.

In the real world, reservoirs do more than supply a community with water. The aggregate has its own uses: water skiing or hydro-electric power generation for example. And the same will be true of semantic web data reservoirs: large collections of data can be analysed and re-purposed in ways that are not possible – or at least not achievable without a great deal of repeated, redundant integration effort – using other techniques. The reservoir itself can be the source of new facts and new streams of data derived from analysis of its contents.

Flowing Data through the Talis Platform

The goal of the Talis Platform is to support the growth of the Linked Data ecosystem by providing the infrastructure to support the creation of pools of data. For additional background, see my article “Enabling the Linked Data Ecosystem” from Nodalities issue 5.

At present the Platform provides a range of services that allow data to be easily streamed into and out of Platform stores, allowing data to be easily pooled in order to benefit from greater context. Data can be pushed directly into the Platform and we are exploring methods of supporting other forms of data ingestion to make it easier and more natural to begin to accumulate data sets within the Platform.

The core search service, which produces its results in RSS, allows the creation of simple data streams, while the SPARQL interface supports more complex data extraction methods. The Augmentation service provides an interesting twist on these conventional approaches, providing a means for any RSS 1.0 feed to be automatically enriched with extra metadata by feeding it through a Platform data store. This means of interaction is like fishing for data: it is possible to serendipitously find and extract data, capturing it as extra context to items in an RSS feed, without having to deal with writing SPARQL queries or constructing a keyword search. There are many more methods and modes of data extraction that will be added to the Platform to add to these existing services; this is just the beginning.

But the Talis Platform is intended to provide much more than just the ability to work with pools of data. The bigger vision is to support the creation of true data reservoirs, and enable many different ways of manipulating and analysing their contents in order to discover new facts and bring new context to that data. Creation of these larger pools of content will need to be made sustainable for the communities that are creating them, and deriving value from them. Sustainability covers a wide range of issues that go beyond just commercial issues: quality and range of services are additional factors, as are forms of governance, trust and quality that relate to the data sets themselves. The Platform is intended to address all of these issues.

To take a small example, the experimental “store groups” feature that was released at the end of last year, provides a simple method for combining datasets, without requiring that data to be completely loaded or copied into a single database. The store groups feature will ultimately support a range of services over the constituent data sets, allowing each pool of data to remain intact whilst still contributing to the whole; this will be important to support the new forms of governance that are beginning to emerge around datasets on the Linked Data web.

Discovering SPARQL

By Alex Tucker

| This article will feature in Nodalities Magazine, Issue 6

Confessions of a Semantic Web Junkie

Let me start with a confession: I’ve been banging on about the semantic web to anyone who will listen to me for the past nine years.  For some apparently deep seated reason, which some have even labelled perverse, I’ve kept at it despite endless meetings, misunderstandings, off-handed dismissals and blatant refusals to accept what seems to me to be obvious.  It’s been a long and sometimes despairing nine years, but looking around now at the state of all things semantic web, one can’t fail to realize that it’s finally taking off and that companies like Talis are fully committed to its success.

During much of this time I’ve been working for various defence organisations showing how using semantic web standards and tools can help solve issues of interoperability — getting different systems to talk to one another.  That the semantic web, and more specifically the semantic web ontology language OWL, can address these kinds of issues shouldn’t be too surprising, given that the US Defense Advanced Research Projects Agency (DARPA) helped fund the process which eventually led to OWL, precisely to solve issues of semantic interoperability.

More recently, I’ve had the pleasure of working with various teams and projects within NATO who have embraced the semantic web ideas and expanded on them in novel ways.  A fundamental shift in attitude in defence, which has in part been driven by the 9/11 Commission’s findings, is that intelligence agencies must move from a ‘need to know’ to a ‘need to share’ mentality.  For me, this shift has echoes in the ‘linked open data’ ideals, and for someone who also bangs on about open source, open standards and all things open, it’s a step in the right direction.  Obviously, sharing information in a military context is a little different to the ideals of linked open data, but the fundamental issues around getting the information out of existing silos and usable by other systems are much the same.

A recent success in one such NATO project has been the decision to move from using a bespoke query protocol for RDF, a sort of ‘query by example’, to using the World Wide Web Consortium’s new semantic web query language, SPARQL.  This might seem obvious, but a few years ago when the protocol was being developed, SPARQL didn’t exist — therein lies another discussion on how organisations need to be more agile to be able to cope with the length of a ‘Web Year’.  As far as the project group are concerned however, this is great news, as they don’t have to come up with a ‘Standardization Agreement’ or STANAG to define their own query protocol and can instead just point to and use an existing standard, SPARQL.

Bonjour

Another standard used by this project is Bonjour (formerly Zercoconfig or Rendezvous), Apple’s open standard for no-nonsense network configuration and service discovery—the technology behind iTunes’ ability to discover and play music from other iTunes applications on different computers. The great thing about using Bonjour for this project is that it’s decentralized: there’s no need to set up the registry of information providers beforehand. Instead, each provider is free to publish their existence and their capabilities, both on a local and wide area network, without setting up any new infrastructure. Each service in a local network can publish its details in any number of domains. The clouds above represent local networks, the arrows represent the act of publishing, or registering service details to a domain, the cylinders represent services, and the ‘.local’ is the domain for the local network.  iTunes, for instance, currently publishes itself only to the local area network (which is different to the initial iTunes release where people realised they could share their music with friends over the internet by advertising their iTunes database to their own domain name, how terrible).  The domains themselves correspond to the normal domain name system (DNS) names we’re used to, since DNS is really at the heart of the Bonjour protocol.

Services can then be discovered simply by looking for a particular service type in a particular domain; a typical Bonjour service discovery query would ask for all printer services in the local domain.  These service types are given slightly cryptic names (iTunes, for example, uses _daap._tcp), but are all listed on-line at dns-sd.org/ServiceTypes.html for everyone to see and re-use.

Hello SPARQL

What we’ve done then is create a service type for SPARQL, allowing information providers to publish their SPARQL ‘endpoints’ to be discovered by information consumers.  Technically, the SPARQL service type is _sparql._tcp and is listed at dns-sd.org/ServiceTypes.html along with a short description of the properties which should be published.

In keeping with the RDF mantra that “anyone can say anything about anything”, we’ve tried to ensure that anyone can publish a description of not only their own SPARQL services, but also those of others.  A published SPARQL service record has two properties, one called ‘path’ and one called ‘metadata’, from which a client derives two URLs, the former pointing to the SPARQL service endpoint, and the latter pointing to some arbitrary (RDF/XML encoded) metadata it can fetch. The normal approach would be to just use simple paths (e.g. /sparql and /service-metadata.rdf) for the values of these properties, which would be interpreted as pointing to the discovered host (e.g. http://dbpedia.org/sparql and http://dbpedia.org/service-metadata.rdf).  However, by using a full URL for the value of either property, a service record can be published which points to a service or some metadata on a different host entirely, allowing us to form what are essentially proxy records for existing SPARQL services which don’t (yet) use Bonjour service discovery.

The Bonjour specifications are firmly geared towards making the user’s life simpler, so for instance while the name of a service is really just a normal DNS name, the specifications insist that it should be a human readable name with proper capitalization, spaces etc., although no dots are allowed.  The specifications also give guidance as to the sorts of properties and values a service record should contain, and are keen that service records be as concise as possible and leave much of the nitty-gritty details of discovering service capability to the main protocol, in our case SPARQL.

Into the voID

However, the SPARQL protocol doesn’t have much to say about what to expect of an endpoint other than, “here it is.”  Our approach to this has been to allow a little bit of extra information in the service record, for instance using the ‘vocabs’ property to declare the URIs of any vocabularies the service uses.  To find out more information about a service, a consumer can use the value of the ‘metadata’ property to fetch a fuller service description.  The only restriction is that this service description should be marked up in RDF/XML — what better way to encode metadata?

As to what this service description metadata should contain, well that’s currently left to the provider.  In an ideal world, using RDF and OWL to describe a service would mean that it is completely self-describing and that a consumer could fetch the document, fetch any referred ontologies, and figure out what it all means completely automatically.  In reality, we at least need some existing, vocabulary to refer to, even if the client can infer and interpret meaning using ontologies.  For us, that vocabulary is a bespoke ontology which is currently being worked on, but which is good enough for our needs right now.

So I was heartened to be shown a glimpse of voID last year, which upon further reading looks to offer exactly the sort of service description vocabulary needed to complement our Bonjour SPARQL service type.  Another great thing about RDF (see, I’m such a zealot), is that an RDF document can easily use multiple vocabularies, and the instance data doesn’t even necessarily need to be linked together, so the service description document can happily accommodate the use of  more than one service description vocabulary or ontology without breaking anything.  We’ll see what happens, but my vote would currently be for voID to be the suggested minimum requirement.

Why?

There’s currently a list of SPARQL endpoints on the World Wide Web Consortium’s ESW wiki, along with a comment along the lines that the list will probably not get much bigger in the long run.  The comment itself is well reasoned and not necessarily meant to be a negative one, but along with similar comments from peers certainly makes us stand back and wonder at the usefulness of any kind of registry, or even a decentralized set of registries, of SPARQL services.

There’s no doubt that the project I’m working with needs this right now, so it’s certainly useful, but what about when, and if, things get much bigger?  A well known web technologist has published Bonjour service types for HTTP and HTTPS at dns-sd.org, but lately there doesn’t seem to be much take-up for this kind of facility, even if Apple’s Safari web browser has built in support with its Bonjour bookmarks feature.

On the other hand, Google does such a wonderful job, giving us a single endpoint to query the whole of the web, that it’s tempting to think that the semantic web will ultimately be sucked up in a similar fashion into a great big semantic data warehouse in the clouds, and there’ll be no need for anyone to offer up their own SPARQL endpoints.

My own expectation is that we’ll end up with something in between.  On the one side there will be one or a few massive Google like entry points which will, by spidering across the web, perhaps following the linked data trail, suck up much of the semantic web and present it to users in a simple, easy to consume and re-use entry point.  But just as when you’re searching for something specific, for instance buying a book, or trawling through your bank statement, you don’t start at Google, I think there are plenty of cases where you’d want to be able to automatically discover SPARQL endpoints which you can access directly and which offer you precisely the details you need, there and then.

The semantic web is different to the (syntactic?) web, in that it makes it much easier to take information from multiple sources and join it together into something new and more useful.  Imagine if everyone’s desktop were part of the semantic web, with all the information about from emails, photos, music etc. offered up through SPARQL services (take a look at the Nepomuk project for an existing implementation of this).  Imagining the sorts of applications we could build on top of the Semantic Desktop is an exciting prospect, but it’s not something I’d expect most people would want to make available through a single Google style data warehouse, even if there were privacy safeguards in place, in the same way that most people don’t want to necessarily share all their photos on Flikr or Picasa.

Using Bonjour to discover SPARQL services means that a user can easily create or select a list of domains where their client applications will look for published enpoints, be they public or private, and given the right credentials they will be allowed to use those services to build the sorts of applications they want, no matter how esoteric: show me a list of all the photos of my son, sorted by the number of teeth he had, using my local photos as well as the photos taken by his grandparents; give me a list of washing machines, along with prices from these suppliers, where the washing machine is no wider than 60cm and has a Which Best Buy rating over 70%.

Of course, if we start making all this sort of information available through our own SPARQL services, then there are all sorts of issues around trust, privacy, provenance and accuracy, but at the very least we are in control of our own data.

Try It!

Using Bonjour to publish and discover SPARQL services is simplicity itself, and I invite you to take a look at www.floop.org.uk/eagle/discovering-sparql for some command-line examples.

While at the moment none of the SPARQL servers out there publish their details using Bonjour, I’ve created a simple application for Tomcat which can publish any of the services it runs using Bonjour, and have used it to publish Joseki based SPARQL services.  I’m also working on both a SPARQL endpoint and Bonjour publishing for Plone and Zope.  See http://www.floop.org.uk/projects for more details of these nascent projects.

What would be great is if the existing SPARQL servers and triples stores out there could add optional support for registering the details of any endpoints they make available, using Bonjour.  It would certainly help the uptake of SPARQL in the projects I’m involved with.

Alex Tucker is a self-employed semantic web consultant, specialising in defence.

A conference comes of age: a review of the 7th International Semantic Web Conference (ISWC2008)

| This post will feature in Nodalities Magazine, issue 5.

What are the factors that indicate a coming of age? An increased
self-awareness perhaps, or an acceptance and understanding of a broad
range of views, even if they contradict your own? If these factors do
indicate a certain maturity, then I would argue that the International
Semantic Web Conference series has come of age.

Last year’s event in Busan, Korea felt like a watershed moment, with
an increasing focus on practical applications that exploited Semantic
Web technologies, in addition to the highly theoretical papers
typically seen at events of this sort. This year’s conference in
Karlsruhe, Germany, and the seventh in the series overall, maintained
this momentum. But more so than previous years I detected a subtle
change in the mood of the conference. In addition to a tangible sense
of excitement that the Semantic Web was getting ready for the
mainstream, I detected a certain pluralism within the community,
manifested as a greater openness to divergent views and an increase in
attention to topics that might have previously been overlooked.

This willingness to express and accept divergent views was apparent to
me no more so than in the panel titled “An OWL too far?”. This
discussion saw senior members of the Semantic Web community openly
challenge each others views on the proposed second version of OWL, the
Web Ontology Language. Perhaps the views held by the likes of Stefan
Decker, Frank van Harmelen and Ian Horrocks have always been divergent
on this issue, but seeing the differences of opinion aired so openly
was a new experience for me. Far from indicating a damaging lack of
unity in the field, I read this as a clear sign that the community can
engage in open and constructive debate without throwing the toys out
of the pram.

Earlier in the week I had sat on a similarly provocatively titled
panel in the OWL Experiences and Directions workshop – titled “How
might OWL fail?”. As a relative outsider I decided to focus on the OWL
community’s need to improve its marketing and demonstrate its
relevance to the wider world, and expected a degree of hostility to
this message. Instead I sensed a slight deflation at the criticism
that was quickly followed by a desire to engage with the problem and
actively address it.

Perhaps the most powerful sign of how far the Semantic Web community
has come was in the entries to the annual Semantic Web Challenge. This
year the contest had two tracks: the Open Track, which is analogous to
the regular challenge in previous years and has a more established set
of judging criteria; and the Billion Triples Track, an attempt to
stimulate people to generate value from and add value to increasingly
large data sets, with the definition of what constitutes “value” being
more open-ended.

The quality in both tracks was exceptionally high, but one feature
that ran through most of the finalists struck me in particular – the
emphasis on the user experience. Previous challenges have always
attracted user-oriented applications as well as backend technologies,
but this year felt different. Whether the application was supporting
personal aggregation of one’s distributed information, as in the Open
Track winner Paggr; enabling location-oriented browsing of the
Semantic Web on a mobile phone, as in DBpedia Mobile, which took
second place in the Open Track; or providing structured browsing over
billions of RDF triples, as in SemaPlorer, winner of the Billion
Triples Track; the vast majority of entries recognised the need to
both add value to the data *and* provide a compelling user experience
over this.

For me this indicates not just an awareness but an acceptance on the
part of the Semantic Web community that no amount of research and
development at the backend will make a difference if clear user
benefits are not delivered. If this serves as evidence that the ISWC
series has come of age, then I would argue that along with it so has
the Semantic Web community at large. It may have taken some time, but
I have no doubt that this maturity has been earned.

Issue 2 of Nodalities Magazine is now available

Issue 2 of Nodalities Magazine is now available online. For those who have signed up to the free subscription, your printed copy is in the mail.

Items this month include:

  • Blue Oceans – Ian Davis and Zach Beauvais discuss the ‘Blue Ocean’ opportunity facing those who embrace the Semantic Web
  • Social Networking – Garlik CEO Tom Ilube introduces the notion of ‘social verification’
  • Environment – David Peterson puts semantic technologies to work in the fight against Climate Change
  • Predictable Mavericks – Talis CEO Dave Errington looks back at the company’s past, and forward to a semantically powered future
  • Open World Thinking – Nadeem Shabir argues that Semantic Web developers need to see the world differently
  • Dow Jones and Thomson Reuters – Read transcripts of recent conversations with these factual information powerhouses, and learn how the Semantic Web is being put to work.

Talis launches Nodalities Magazine

I’m pleased to be able to announce the launch of our new magazine, Nodalities Magazine. Issue 1 is available now, both online and in print. Subscription is free, and subscribers will be sent each new issue as it is printed.

Articles this month include Tom Heath’s look ahead to next week’s Linked Data on the Web workshop on Beijing, Mills Davis on the value of Web 3.0, Nadeem Shabir on the development of a commercial Semantic Web application, and the full transcript of the recent interview with Sir Tim Berners-Lee.