Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Nodalities Magazine' Category

Might Semantic Technologies permit meaningful Brand relationships?

| This post will appear in Nodalities Magazine, Issue 7

by Paul Miller

Much has been written about growing Enterprise use of social media (usually Twitter, these days) to successfully track and mitigate customer complaint. Many have been quick to spot that the disproportionately high cost of satisfying (or, more cynically, silencing) these early adopters is unlikely to scale effectively as an increasingly large cohort of customers move onto these services, and it must remain an open question as to whether ComcastCares and its peers can survive any move to the mainstream in recognisable form.

It appears, though, that Enterprise engagement in the social sphere changes the game far more significantly than merely enabling a select few twitterati to jump the Customer Support queue, and that this change is worth effort and investment in order to ensure that it does scale. What’s actually happening is that a relationship is being enabled between a brand and those that Seth Godin might recognise as its tribe; a relationship in which interactions are no longer driven predominantly by the desire to seek redress. Rather than only raising those issues serious enough for us to have written letters or endured telephone muzak in the past, we now comment on issues at the periphery of a brand. Collectively, we’ve moved from simply complaining about the worst failures of companies, their products and their employees, toward emitting an impressive stream of FYIs. Individually insignificant, and possibly unimportant, together these light touches on and around a brand build into an ever-changing and valuable commentary that brands and the corporations they front would do well to take notice of. The minor niggles about an otherwise exemplary service, the human touches that made us smile, the odd inconsistencies in a polished persona; none are enough to make us pick up the phone, but we comment upon them endlessly in Twitter, Facebook, FriendFeed and elsewhere, and by tapping into this fundamentally honest stream of consciousness there is much for those about whom we comment to learn. Good companies probably already know about fundamental failings in a product long before their customer support operation melts down under the weight of complaints or their quarterly sales targets are seriously under-achieved. Do they have as good a handle on the things we love? Do they have a clue about the minor gripes of customers outside their pre-launch polling groups? Do they know about the gut reaction to a colour, a touch, a smell, or a careless word that persuaded a likely prospect to buy a technically or aesthetically inferior product from the competition instead? All this and more is there for the taking in the stream of online chatter freely directed their way.

Semantic Technologies aren’t often directly associated with the worlds of Marketing and Commerce, yet individuals such as Eric Hillerbrand and Scott Brinker are hard at work to show just what might be possible when the experiences of the Semantic Web are applied to this space. Brands are no longer owned by the companies in whose name they were created. Increasingly, ownership of various forms is being asserted by the multitude of stakeholders with effort and attention invested in the brand. They care about it, they care about what it says about them, and they play a clear role in the brand’s evolution whether its managers want them to or not.

Brands need to engage in this conversation, as we are beginning to see them do, but they also need to discover the means to cost-effectively monitor and engage with a potential flood of third party reaction whilst using the Business Intelligence tools available to them in nimbly shaping public opinion to their advantage wherever possible.

I spoke with Scott Brinker last year, to explore his—then nascent—views on Semantic Marketing, and look forward to hearing his latest thoughts at the Semantic Technology Conference in San Jose in June.

More recently, Eric Hillerbrand talked about some of his ideas with respect to ‘Social Commerce,’ and the ways in which commercial organisations might seek to strengthen and exploit relationships with their customers, aided by a range of semantic technologies.

We’re just beginning to grasp the realities of a world in which tightly controlled and fiercely guarded brand attributes become increasingly permeable. For those companies with the confidence and foresight to loosen their grip, whilst simultaneously exploiting the wealth of data and new opportunities to engage, there is much to be gained. For the dinosaurs that hang on to ‘their’ brand in spite of the world around them, there is everything to lose.

Linked Data and the Public Domain

We love data at Talis and we want as much of it to be freely reusable as possible. In fact, because we wanted to see even more reusable data we recently launched the Talis Connected Commons offering completely free hosting of public domain data. We believe that dedicating data to the public domain is the best way to ensure that data is universally reusable and remixable. When data is public domain it means that it can be reused automatically without needing to check terms and conditions or track the source of every statement to provide attribution. These kinds of things act as friction to reuse, wasting energy that could be better spent creating inspiring things.

We also firmly believe that, in the future, there will a significant role for other forms of data licensing, including commercial access. We will support those efforts too when the time comes but today the Linked Data web needs more and better data that is freely accessible.

Licensing vs Waivers

You are probably familiar with the process of licensing a creative work, most likely through the great job that Creative Commons have been doing in recent years. However, the concept of waivers is less well known but highly relevant to reuse of linked data.

Whenever you create something you have automatic rights over it granted to you. The best known of these rights is copyright, which gives you the exclusive right to make copies of your creative work. There are many other rights which can be held over intellectual property such as design rights, trade marks, registered designs, performers rights, trade secrets, database rights, publication rights and many more.

Licensing is the process of granting others limited use of rights you possess. For example, when you license your copyright you are granting specific people a limited right to make copies without having to ask you first. Licensing of one right does not affect your possession of the others. For example you could grant the right to copy your work but retain the right to perform it. Creative Commons licenses are mostly concerned with copyright, but they do not usually deal with the other rights such as database rights or trade secrets.

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

The Licensing Problem

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

There isn’t a Creative Commons license for every possible right and there probably can’t be because of the huge variation in rights granted in different jurisdictions around the world. Also, when we start to look at licensing compilations of data we find that the situation becomes complex because you have to consider both the database and its contents seperately. For example a document of articles would be subject to database right over the whole collection and individual copyrights for each article, quite possible to many different owners. The Open Data Commons has addressed this particular example with its Open Database License and Database Contents License (based on work originally donated by Talis). If a standard license doesn’t exist then you need to hire lawyers and write one for yourself – a potentially huge cost.

Our collective goal for a successful Linked Data web has to be to protect consumers of the data: the people who are remixing many different sources of data. Our intentions may be very honourable, but people need certainty if they are to build enduring value on data. Creative Commons licenses are irrevocable so even if you lose control over your work through some misfortune, the people reusing it will be protected forever. Imagine this scenario: you allow people to use data you have collated but your company goes bankrupt and the rights to the data collection are sold by the liquidators. If you hadn’t licensed your rights explicitly then every one of your users could be liable to be sued by the new rights holder!

This is where waivers of rights can help. By explictly waiving your rights over your data then you are giving your users the best guarantee of safety that you can. Even if you lost control of the data collection subsequent owners could not persue your users because the rights you held have already been waived.

There are two waivers of rights that can be applied to datasets:

Both of these waivers can be used for data intended for submission to the Talis Connected Commons.

Community Norms

When you apply a waiver like CC0 you are relinquishing all your rights over the work to the fullest extent possible under the law. That means that you cannot force people to attribute you or stop them from making commercial use of your work.

The preferred approach is to attach a set of community norms to the work. These are like a code of conduct for use of the work and are usually self-policing. They are not legally enforceable but form part of the ethical or professional requirements for participating in a community. The best known example of community norms are the citation standards used in the academic commnity. Citing pre-existing work is not legally enforceable but those who abuse the norms can find themselves excluded from the academic community.

The Open Data Commons has published a set of attribution and share-alike norms which asks that users of the data:

  • Share work derived from the data.
  • Give credit to the original data publisher.
  • Point others at the source of the data.
  • Publish in open formats.
  • Avoid using digital rights management.

How to Declare Your Waiver

To delare your waiver in a machine readable way, you should first create a voID description of your dataset. VoID, or Vocabulary of Interlinked Datasets, is a vocabulary designed to describe key attributes of your dataset. We created a waiver RDF vocabulary that can be used with voID to declare any waiver of rights and the community norms around a dataset.

In this example we describe a dataset using the void:Dataset class and provide it with a dc:title as a minimal human readable description. You should add other descriptive properties as necessary (some suggestions can be found in the voID guide).

We then use the wv:waiver property (defined in the waiver RDF vocabulary) to link the dataset to the Open Data Commons PDDL waiver. We use the wv:declaration property to include a human-readable declaration of the waiver. This is purely informational, but can be immediately be used by a person examining the voID description. Finally we use the wv:norms property to link the dataset to the community norms we suggest for it, in this case the ODC Attribution and Share-alike norms.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
    <wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

Alternatively if you were to choose the CC0 waiver without any particular norms then you should use the following RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/"/>
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

These examples show that it is very simple to declare your waiver. However, before you do so be sure to read carefully what rights you are irrevocably giving up. For example you would most likely be waiving your publicity and privacy rights, so if your image is included in the dataset you could not later complain that someone is using it in a way you do not approve of. If you are worried about how your work will be used, if you want to legally require attribution, or if you don’t want people to make money off of your work, then you should not use a waiver and instead seek legal advice on the creation of a data license specific to your needs.

Linking Data and Semantics at O’Reilly

By Gavin Carothers and Charles Greer

|This article features in Nodalities Magazine, Issue 6

O’Reilly Media lives on the cutting edge. We coined terms such as Web 2.0, created the first commercial website in 1993, and exist to “spread the knowledge of innovators.” With our evangelists, conference presenters, authors, and bloggers all communicating and catalyzing new ideas, many believe that O’Reilly must be just as technologically innovative in our own operations. However, O’Reilly employs about 200 people but only half a dozen developers, so naturally ideas are thrown at our developers faster than it is possible to implement them. We’ve been known to refer to this tension between our public position on the cutting edge and internal expectation to live up to what we preach as “gaping wound tech.” Any time someone had a new idea or a new product to launch that didn’t quite fit into existing systems, we found some way to shoehorn it in, with a quick Perl script or some clever custom SQL. As we did this, more and more of our work became preventing our systems from collapsing under the weight of those one-off ETLs and scripts. The cost of simply keeping track of which scripts were using what bit of transformed data and where that data came from had became so high as to become unsustainable. We’d accrued so much design debt that only the most radical of approaches could save us from being crushed by the weight of our inherited code.

Of course, we didn’t really know that at the time. Today we have a Linked Data, Semantic, RESTful, URI-based, highly buzz-wordy solution mostly by accident and through ruthless pragmatism. Instead of embracing the ideas of the Semantic Web at the outset, we arrived at the Semantic Web because it was the only solution. We thought we were traveling down two completely unrelated roads. We started down the first while trying to replace a Java Bean Shell script that copied book content to a few different places. The other road began when we wanted to know what color to make the border of a PDF. The first would lead to an Atom Publishing Protocol server and clients, the second to our modeling all product metadata in RDF and opening that to the public.

As it turns out, the two roads weren’t so unrelated after all. RDF is designed to handle modeling information in a distributed manner and provides the underpinnings for the actual metadata we store, aggregate, and use. AtomPub’s RESTful interface is ideally designed for managing individual chunks of all this distributed data over time and provides programs and people a simple, standard interface for publishing, accessing, and updating it. As we progressed down each path, we were making (often unknowingly) major progress in generating linked data and semantics, the two pillars of the Semantic Web.

The RESTful Road

In 2005, soon after O’Reilly launched a custom book publishing platform, we discovered that we’d deferred a hard question. We didn’t know how to make sure that we could easily add new books as they came down the production pipeline. The canonical representation of nearly all O’Reilly titles is DocBook files. Historically, these DocBook files were scattered across many filesystems, transformed by people using one-off scripts, and arbitrarily transmitted using FTP to other filesystems. We simply didn’t have a way of addressing fundamental questions like “Where is the latest, cleanest copy of a book’s markup?” Tracking down the best representation of a book’s content was a laborious, error-prone task.
Around the same time we ran into this, we noticed Tim Bray’s superb presentations about the then-draft form Atom Publication Protocol. The architecture proposed by RESTful advocates like Bray and embodied by what would become RFC5023 gave us the ability to store an atomic chunk of data, assign it a URI and access and update it through a standard interface.

  • A book’s ”source code“, the DocBook markup
  • The print book, as an ISBN
  • The table of contents
  • A HTML, PDF or other representation generated from the source
  • Whatever Tim O’Reilly or the business folks asked for next

O’Reilly’s SafariU was a business venture that implemented these kinds of transformations of content, but didn’t expose anything but it’s own web browser interface.  When considering how to leverage SafariU’s technologies in the business as a whole, we arrived at this:

This atom:entry is the “latest, cleanest copy of a book’s markup” and its URI is the canonical location for this content. Additionally, the entry provides different views of the content using 17 distinct <link/> elements We had embraced the linked data idea Noun = URI.   Around the same time, we realized that while we needed a way to address various available formats of content, we also required a place to store and maintain our digital assets.   By implementing the Atom Publishing Protocol we established a generic way to maintain our assets, as Nouns, over time.  Now that systems could reliably find and update our content using URIs, it became painfully apparent that we still had a major uphill battle—how to do the same thing for product metadata?

A similar problem existed when dealing with metadata. Distinct applications were completely unintegrated and focused only on the browser and human users. They provided no visibility into their data for other systems.

rdf:isNeat

“Can our PDFs have the same branding and colors as the printed books?” —Marketing Person
“Sure! How hard can it be?” —Innocent Developer

At this point in our journey we have more than 900 titles in the AtomPub repository and addressable by URI. We’ve (unknowingly) hit a significant Linked Data milestone and everything is progressing well. Dynamically creating a PDF from these entries is as easy as running our DocBook-XSL customization for the correct series to produce XSL-FO and then rendering that XSL-FO into PDF. The only problem was discovering which series (In a Nutshell, Animal Guide, Missing Manual) the content fell under. At that point all progress stopped.
Our definitive source of book and product information is the Product Database (67,000+ lines of Perl, C++, SQL, and a dozen other languages). The database and web application has its own home-rolled “XML Format,” as I’m sure many other companies have had. Based directly on the column names from the SQL database, our Book XML was a quick and very dirty way of getting our centralized relational data out into the world as XML. A host of new client applications grew around this new access to product data, but we quickly saw the problems of reusing an adhoc, undefined, schema-generated format. The XML service was also incredibly slow.

<IPFamily>

<Book>
<product_id>5549</product_id>
<parent_product_id>6380</parent_product_id>
<imprint_id>1</imprint_id>
<product_status_id>5</product_status_id>
<product_type_id>10</product_type_id>
<isbn>0596515618</isbn>
<isbn13>9780596515614</isbn13>

<final_date>2003-07-02</final_date> <!-- Actually the day the last QC phase ended -->

...


As you can see from the snippet above, clients had to deal with knowing exactly what imprint 1 (O’Reilly Media, Inc.) and product type 10 (PDF) meant. Each client kept mappings of these magic values in order to make the data understandable. Those mappings broke, of course, whenever new product types and imprints were added. Even more dangerously, because the semantics of the XML were totally unspecified, element names were opaque and sometimes actively misleading. We might have redesigned the format to include more data and added more and more fields to it but this wasn’t an explicitly designed schema, just something generated from the SQL. On the road to exposing this data more cleanly we tried everything. Remodeling the SQL to be more relational didn’t offer much benefit and we still couldn’t tell what the column names meant. Sitting down and trying to write up a data dictionary was a great exercise, but it became out of date almost immediately. We experimented with JSON-based CouchDB prototypes, but those had the same issue as the SQL with missing meaning. Our Subversion repository is littered with Relax-NG, XML Schema, and Schematron documents to create new XML-based format. Somehow they never got finished as we discovered we either had to define everything or try to design for extensibility. We knew we didn’t have the time to create our own Book Metadata Standard. We wanted defined semantics.
There is at least one obvious XML vocabulary for a publisher looking to capture book metadata: ONIX. Unfortunately, the ONIX standard is archaic, with obscure element names like b004 (ISBN) and g343 (PrizeJury, obviously) (Footnote: Yes, these are the short versions and a longer set of names is also allowed. However, many of the most important vendors only support the short versions.) We did consider ONIX for a time, but then we noticed that every vendor we sent ONIX to treated the fields a bit differently. Even with pages and pages of specification there wasn’t any agreement on what elements were important or what they meant. Using ONIX as a format would not solve our semantic deficiency, we still wouldn’t know what the “columns” meant.
In the process of trying to create an XML format we asked a number of people in the company how to find the Publication Date for a book. The answer was surprisingly complex. The value was computed independently by each of the ETL hydras, with subtly different implementations that had evolved with particular client needs. O’Reilly isn’t a huge company with layer upon layer of bureaucracy; most questions can be quickly answered with a chat at a desk or an email to the other coast. Imagine our surprise, then, at the results of the Publication Date poll. Most people were confident that one of five dates was the right date, but disagreed on which of the five it was. Retail Availability Date, Actual In Stock Date, Estimated In Stock Date, etc each had its backers. What was really going on was that we discovered the subtle different needs that each business unit had.  The strategy we could most easily support?  Concensus on a public standard.  As we’ve learned so many times, we needed to go outside the company to find the correct solution. Public standards, specifications, and ontologies could save us from ourselves.
Enter: Dublin Core. We couldn’t define our own format or use the industry standard (ONIX), nor could we agree on what a publication date was. Our only choice was go borrow/steal some other group’s ideas. It turns out that our problems had already been solved by the library community. The Dublin Core Metadata Initiative created standards, guidelines, and examples for storing and sharing basic, essential metadata. We had a way out, here was a group of people who’d already done a great deal of thinking for us.
Of course, they hadn’t done all our thinking for us. Mapping all of our old data into well-designed and well-documented Dublin Core, MARC Relators, FOAF, or any other ontology was going to be hard. So we didn’t do it. Instead we mapped the whole of our old, horrible, ugly mess into an undefined ontology called the “Product Database Legacy Ontology.” We then moved some of the more obvious items like title and author into Dublin Core and waited. Only once we had a proven need for a new data point in real application would we go though the process of researching, defining, cleaning, and moving it into a modern, public ontology. For those following along closely: no, trim color isn’t yet in the public or internal metadata. As it turns out, no one really wanted it. At least, not yet.

All Together Now

Since Gavin’s first frenzied port of product metadata to an RDF model, we’ve been able to negotiate changing requirements, establish data validation and control rules, and bring on new applications with little time spent on data modeling. In other words, meeting our immediate need of a centralized, validatated data store of high agility and performance has paid off several times over in deploying new software systems for the rapidly changing company.
One example of the intertwining of Linked Data and Semantics is our Electronic Media distribution system, which lets customers download ebooks, pdfs, videos and the like. Book descriptions, titles, authors names, cover images even the help text provided on the Electronic Media page is simply linked data, built from RDF relationships. When we want to change the help text or a category label, we change it in one document, and everything else in the RDF graph referencing it changes with in moments as well. Just following links pays off.
Previously, the buttons that let a customer add a book to our shoping cart were generated by a system that used nightly ETLs nicknamed “the sync”. So new products would have to be prepped for release the night before. We gave special care to their timely appearance in the morning. Alas, they frequently did not appear as hoped, as the ETLs that made up “the sync” had to run in a very precise nightly schedule or we had to take manual corrective action. Now, a reasonably simple HTML template bound to the RDF for a book generates “Buy Buttons” in near realtime without an ETL in sight.
The greatest challenge of updating our legacy IT infrastructure hasn’t been replacing the ETLs or synchronization. It’s been achieving consensus on the meaning of data elements. In the past, data maintainers might adjust the title of a book to change how retailers present it. Then our website’s title would change (the next day), and we would have to bring resources to bear on reconciling the meaning of “title.” By using for our title element, we’ve established what to expect from those who change the value. It’s simpler to make sure people enter particular kinds of data, and then ask for help to extend or change requirements for downstream apps. The publicly available ontologies, we hope, will help everyone communicate more effectively about business needs and shared data points. So far the results are encouraging.

In the Public Eye

Having built several of our own applications using our new RDF metadata and our initial linked data APIs, we thought it might be a good idea to let someone else have a crack at it too and see what they made of it. It took us two weeks to develop the O’Reilly Product Metadata Interface, a simple layer on top of the Deli. A caching proxy preserves the reliability needed by our own applications, while a predicate filter prevents private information from leaking to the public. A bit more about how you can access it can be found at http://labs.oreilly.com/opmi.html or you can just dive right in by giving it an ISBN, IE: http://opmi.oreilly.com/product/9780596529260.
Sharing our work with the public forced us to be much more deliberate and rigorous about our data, but also exposed some simple blunders. On the day we launched the service we waited for the praise to come in and finally saw a tweet! Someone is using… Oh wait:

OPMI’s book identifiers aren’t resolvable. Sigh.” —Jeni Tennison

“Of course they’re resolvable,” we thought. “You just have to parse the URN and understand how to pass the URN to… oh, yeah good point.” In the process of implementation, we’d forgotten Tim Berners-Lee’s second rule of Linked Data:

2. Use HTTP URIs so that people can look up those names.

At the start of the process we’d talked about about using some sort of identifier for our products. But that conversation had taken place before we really had all the RDF and Linked Data applications working, so at the time there wasn’t any point nor could anyone see the need for a resolvable identifier. Within a few hours of making the data public, the need became blindingly apparent. Part of embracing “anyone can say anything about anything” is that anyone needs to be able to find the anything they want to talk about. And when you’ve got a statement to make, it’s remarkably handy to be able to quickly find out what else has been said. “I loved urn:x-domain:oreilly.com:product:9780596529260.BOOK” is a bit hard to figure out. “I hated http://purl.oreilly.com/product/9780596529260.BOOK” is a lot better.

Streams, Pools and Reservoirs

by Leigh Dodds
| this article features in Nodalities Magazine, issue 6

As we start to move past the current boot-strapping phase of the semantic web in which we are constructing the web of linked data, its useful to begin discussing what other feature and infrastructure we need in order to support sustainable usage of this huge and growing data set: what services can be offered over linked data? Do we need to consider how to provide quality of service, stability and longevity to the data, or does the sheer scale of the web make these moot points?

In order to answer this question it’s useful to compare the ongoing development of the linked data web with that of the web itself.

A Brief History Lesson

There have been several phases of activity in the development of the web. While in truth, these phases were of different duration, overlapped with one another, and have happened at different rates within different communities, essentially we have gone with the following basic steps.

Firstly we concentrated on just getting stuff on line. The early web was a new medium for document and data exchange and so was at its core a simple publishing device used as a collaborative space between small communities. But as the amount of content and the size and breadth of those communities grew, the emphasis shifted towards linking: tying content together to create, – initially hand-crafted – indexes of the web and knit the available content into a greater whole.

The second, manual linking phase was quickly supplemented by a third phase of automated linking between content: search engines. A search engine is simply a way to quickly create a link-base based on some search criteria. The crawling and indexing of the document web by web crawlers allows users to quickly construct links to content of potential interest.

If we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

The third phase of the web’s development has been triggered by the commoditisation of search and the need for search engines to differentiate themselves and offer additional value-added services. Search engine features are now tailored towards particular uses or types of content (Google Image Search; Google Scholar); offer value-added features that capitalise on the ability for search engines to analyse the structure and traffic flows across the web (PageRank and similar indexing improvements; Google Trends); expanding the audience for content (Google Translate); and enabling community-driven customisation of the search experience (Google Custom Search; Yahoo Search Monkey, etc).

No doubt there will be subsequent phases of development, and the perspective of history will let us tease out common strands of development some of which will already be happening. But if we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

History Recapitulated

There has been RDF data available on the web for many years, used by a limited community of researchers. This slow accumulation of content – echoing the first phase of content publishing on the document web – has been replaced by a rapid increase in data publishing encouraged through the Linking Open Data (LOD) project. By providing clear pragmatic guidance and instructions on how to publish data for the semantic web, that project has enabled us to accelerate our transition through that first content publishing phase. But it has also, crucially, encouraged the linking together of data sets (Phase 2).

This linking has to a great extent been manual. Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

The rapid growth of the Linked Data Cloud means that this “manual” phase will soon be over: there will be sufficient momentum behind the semantic web that increasing amounts of data will become available and no single community will be able (or need) to shepherd its development. The focus will shift towards the subject specific communities who will instead co-ordinate at a more local level. Semantic web search engines will also become a reality.

Semantic Web search engines need to be distinguished from semantically enabled search engines. The latter use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans. A Semantic Web search engine should offer infrastructure for machines. This Third Phase is also beginning to take place. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct link bases, based on some simple expressions of what data is of relevance, in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents.

Co-Evolution of the Web Infrastructure

Given the strong concordance between the phases of development of the document and linked data web, it is reasonable to make some predictions on how semantic web search engines, and additional supporting infrastructure, is likely to evolve by comparing them with the development of human search engines. For each of the specialisations and value-added features listed earlier its possible to see an equivalent for the machine-readable web:

Document Web Semantic Web Infrastructure Description
Google Image Search Type Searching Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate Vocabulary Normalisation Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search Community Constructed Data Sets and Indexes Ability to create and manipulate custom subsets of the linked data cloud
Google Trends Linked Data Analysis & Publishing Trends Identifying new data sources; new vocabularies; clusters of data; data analysis

These last two are particularly interesting as they suggest the need to be able to easily aggregate, combine and analyse aspects of the linked data cloud. This infrastructure will need to be able to support the community in working with data in a variety of ways, allowing data to flow and be collected where it is needed. Introducing a metaphor for this process might help highlight some of the processes and its consequences.

Flowing Data

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir.

Data is like water and flows of data are like streams. These streams of data can arise from any number of different sources: from a person entering data into a system; from a click stream generated as a side-effect of web browsing; application events; or generated from real-world sensor measurements. There are already many ways that we can tap into these data streams, using web-based query APIs, messaging systems like XMPP, or syndication protocols like Atom and RSS.

While these streams of data are already supporting a huge range of different applications and use cases, they are inherently limited: a stream has no memory. If historical context is required, e.g. to support more complex querying and reporting, then each consuming application must collect and store the data. We can think of these collections of data as pools; each stream of data on the web may feed any number of different application-specific pools.

A pool of data provides extra flexibility, but comes at the cost of requiring each consuming application to maintain its own infrastructure to hold copies of that data. Even if each source of data provides direct access to its own pool, e.g. by exposing a web-based query interface onto its database, or by exposing linked data, there are still unnecessary overheads. Each data provider must provide their own scalable infrastructure and support a rich set of data access options.

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir. A reservoir is a pool of data that is maintained by and services a specific community. Reservoirs allow issues such as quality of service (reliable supply of water) and infrastructure costs (building of pipelines) to be solved at a community level.

Its possible to argue that the web already consists of streams, pools, and reservoirs, but there is a distinct difference between a web based on semantic web technology and a Web constructed of a mixture of XML documents or similar formats: like water, at the molecular level, all RDF is the same; its all triples. Unlike alternatives, RDF data is more easily pooled and collected and so is much more amenable to explorations of shared infrastructure. Like a relational database, an RDF triple-store can contain an huge variety of different kinds of data. But unlike a relational database, an RDF triple-store, has the potential for the aggregate to be much more than the some of its parts. The seeds of convergence are built in, through reliance ah the most fundamental level on a global naming system (URIs) and standardised ways to state equivalence and relationships between resources.

In the real world, reservoirs do more than supply a community with water. The aggregate has its own uses: water skiing or hydro-electric power generation for example. And the same will be true of semantic web data reservoirs: large collections of data can be analysed and re-purposed in ways that are not possible – or at least not achievable without a great deal of repeated, redundant integration effort – using other techniques. The reservoir itself can be the source of new facts and new streams of data derived from analysis of its contents.

Flowing Data through the Talis Platform

The goal of the Talis Platform is to support the growth of the Linked Data ecosystem by providing the infrastructure to support the creation of pools of data. For additional background, see my article “Enabling the Linked Data Ecosystem” from Nodalities issue 5.

At present the Platform provides a range of services that allow data to be easily streamed into and out of Platform stores, allowing data to be easily pooled in order to benefit from greater context. Data can be pushed directly into the Platform and we are exploring methods of supporting other forms of data ingestion to make it easier and more natural to begin to accumulate data sets within the Platform.

The core search service, which produces its results in RSS, allows the creation of simple data streams, while the SPARQL interface supports more complex data extraction methods. The Augmentation service provides an interesting twist on these conventional approaches, providing a means for any RSS 1.0 feed to be automatically enriched with extra metadata by feeding it through a Platform data store. This means of interaction is like fishing for data: it is possible to serendipitously find and extract data, capturing it as extra context to items in an RSS feed, without having to deal with writing SPARQL queries or constructing a keyword search. There are many more methods and modes of data extraction that will be added to the Platform to add to these existing services; this is just the beginning.

But the Talis Platform is intended to provide much more than just the ability to work with pools of data. The bigger vision is to support the creation of true data reservoirs, and enable many different ways of manipulating and analysing their contents in order to discover new facts and bring new context to that data. Creation of these larger pools of content will need to be made sustainable for the communities that are creating them, and deriving value from them. Sustainability covers a wide range of issues that go beyond just commercial issues: quality and range of services are additional factors, as are forms of governance, trust and quality that relate to the data sets themselves. The Platform is intended to address all of these issues.

To take a small example, the experimental “store groups” feature that was released at the end of last year, provides a simple method for combining datasets, without requiring that data to be completely loaded or copied into a single database. The store groups feature will ultimately support a range of services over the constituent data sets, allowing each pool of data to remain intact whilst still contributing to the whole; this will be important to support the new forms of governance that are beginning to emerge around datasets on the Linked Data web.

Linked Data In(ter)action

By Benjamin Nowack

| This article will feature in Nodalities Magazine, Issue 6

During the recent months, the Semantic Web community is accelerating its progress around web-enhanced information and knowledge management. Specifications such as RDF and SPARQL are increasingly applied by developers and organizations, RDF software is maturing. Even the initial chicken and egg problem around data and applications has now been solved by the Linking Open Data (LOD) project, which is bringing dataset after dataset online, each following recommended practices for simplified information access and repurposing. The time has finally come to move on and create the distributed data applications we have been dreaming of for so long.

Just like the Web’s true innovation was not hypertext as such, but freeing it from isolated CD ROMs, the Semantic Web’s value proposition is not information integration per se, but doing it on a global scale. Network effects will play an important role and have to be considered by application developers. Mashups on a semantic web are not one-off combinations of existing sources and APIs. They will feed their added value back into a self-enforcing Linked Data Ecosystem, thus enabling chains of applications, with each reaping the benefits of the previous one. RDF developers these days often use terms like “Meshup” or “Hyperdata” to describe the direction they are headed.

Linked Data is all about portability and off-site use: The more a respective application attracts users, the more will it let them take their data with them and also integrate external sources. With a bit of luck, we will see not one, but a wealth of killer applications, where the “unique selling proposition” is personal and defined by each user individually.
Despite the ongoing advances, some pieces to the puzzle are still missing. This becomes clearer when we correlate the current state of the Linked Data market to a typical information life cycle classification. While we can name solutions for each value-increasing process (Creation, Organization, Utilization, Distribution, Discovery), the Utilization and Application stage represents a bottleneck. Products start to benefit from Linked Data, but few are also re-distributing their internally enriched information. Additionally, the Creation phase today is mostly driven by dedicated efforts such as the LOD project, although data manipulation and enhancing should also be possible right while people are interacting with semantic web content.

Linked Data Value Spiral

A few months ago, Talis researcher Tom Heath wrote an inspiring IEEE Internet Computing essay titled “How Will We Interact with the Web of Data?” where he described the upcoming challenges and opportunities in the context of human-computer interaction. He suggested that on a web where the granularity is increased from documents to arbitrary things, user interfaces should treat individual objects as first-class citizens, ideally providing context-specific functionality, direct manipulation, and coherence across personal usage scenarios. Application models that go beyond browsing and which are both universal and user-friendly are an ongoing challenge.

A system that aims at finding a sweet spot between simplicity and standardized interaction is Paggr (paggr.com). The basic idea is to combine successful Web 2.0 solutions and trends with Tim Berners-Lee’s concept of an “RDF Clipboard” for polymorphic data exchange between desktop applications. The required technical trick for copy-by-reference across desktop and web applications was introduced by Ray Ozzie three years ago through his “Live Clipboard”. Around the same time, AJAX and converging browser capabilities mass-enabled interactive HTML elements, and personal portal builders such as Netvibes brought widgets and drag and drop to end-users. The amount of open datasets and technical possibilities finally led to a first prototype for building Linked Data Dashboards a few months ago.

The system used Netvibes-like pages with three resizable colums that could be populated with so-called Sparqlets. A Sparqlet is a SPARQL-powered widget, defined by a set of queries and result templates. The output consists of machine-readable HTML which addresses three essential requirements:

  • Widgets can easily be copied to other dashboards, their complete definition is retrievable via HTTP (by de-referencing the widget identifier).
  • Individual items in a widget can be interactively linked to other items, as each element is associated with a URI. This makes semantic drag and drop possible, such as dragging a person representation on a map or an address book widget.
  • Being able to instantly feed augmented data back into the personal or public data cloud.

Architecture

The prototype received encouraging and very helpful feedback at the International Semantic Web Conference (and even won a prize). We are clearly not ready for the mainstream user yet, but building on established interaction models seems to be a promising acceptance strategy. The next iteration of Paggr is now almost finished and we are looking forward to putting it online. The first public applications will be limited to focused use cases (such as an organizer for conference attendees) as we are still working on certain interface behaviors, but a private alpha phase with less restrictions is planned, too.

Linked Data Dashboards face a number of usability challenges. The big question is how to tie the wealth of possibilities to a generic user interface without sacrificing work efficiency. Application convenience often boils down to feature reduction and contextual options, possibly combined with shortcuts for common tasks. To reduce complexity, Paggr lets the user (or app creator) break the theoretically infinite possibilities down into separate dashboards, where options and relations can be further spread across widgets.

The more complicated part starts at the widget level. Semantic drag and drop is often multi-modal. Dragging an event on a calendar does not necessarily mean “Add”, there are many ways to link two persons to each other, etc. Also, working with Linked Data is sometimes like having a backstage pass for a concert: very exciting, but also a bit rough, easily overwhelming, and if you open the wrong door, you can quickly find yourself getting kicked out. Raw data (or equally ugly RDF/HTML dumps) are always just a link away, application designers will try to carefully shield non-developers from being exposed to things like DBPedia pages. For developers, on the other hand, this equivalent to the early Web’s “view source” feature can be very valuable.

Now, what exactly are the requirements and nice-to-haves, and (how) can they be implemented through widgets without leading to cluttered screen estate? As mentioned above, in order to support drag and drop as well as copy and paste between different browser tabs or even at the operating system level, we can use a technical trick introduced by Live Clipboard: transparent form fields that natively provide “right-click / paste” and similar functionality. For a consistent user experience, this means that we need distinguishable (but unobtrusive) fields for each interactive element. In Paggr, small Semantic Web icons next to widget items and title bars signal the availability of advanced options. They enable:

  • widget filtering
  • copying widget or item identifiers
  • removing items from and adding items to widgets
  • interlinking individual items
  • custom contextual menus

Paggr Widget

The approach of using dedicated interaction zones has desirable side-effects. Non-expert users are less likely to get confused, as the general markup keeps its expected behavior. It also becomes possible to disable the semantic extensions simply by deactivating and hiding the icons. A public dashboard or shared meshup may look and feel just like a normal website.

There are still several unresolved issues left and future iterations could well require a complete re-design, but Paggr is just one of a growing number of consumer-oriented Linked Data systems. After years of hard infrastructure work, the Semantic Web community is finally starting to benefit from the investments. Data-wise, we have probably reached the tipping point already. Even former critics start to make their information available in RDF, efforts like microformats, once regarded as competitors, have become accessible from SPARQL, and services like OpenCalais, Yahoo!’s SearchMonkey, or the Zemanta API are constantly reinforcing the network effects of structured open data. It should only be a matter of months until we are going to see the first fully-fledged Linked Data applications for end-users.

Benjamin Nowack is the developer of Paggr. He runs semsol, a tiny Semantic Web agency in Düsseldorf, Germany.

Discovering SPARQL

By Alex Tucker

| This article will feature in Nodalities Magazine, Issue 6

Confessions of a Semantic Web Junkie

Let me start with a confession: I’ve been banging on about the semantic web to anyone who will listen to me for the past nine years.  For some apparently deep seated reason, which some have even labelled perverse, I’ve kept at it despite endless meetings, misunderstandings, off-handed dismissals and blatant refusals to accept what seems to me to be obvious.  It’s been a long and sometimes despairing nine years, but looking around now at the state of all things semantic web, one can’t fail to realize that it’s finally taking off and that companies like Talis are fully committed to its success.

During much of this time I’ve been working for various defence organisations showing how using semantic web standards and tools can help solve issues of interoperability — getting different systems to talk to one another.  That the semantic web, and more specifically the semantic web ontology language OWL, can address these kinds of issues shouldn’t be too surprising, given that the US Defense Advanced Research Projects Agency (DARPA) helped fund the process which eventually led to OWL, precisely to solve issues of semantic interoperability.

More recently, I’ve had the pleasure of working with various teams and projects within NATO who have embraced the semantic web ideas and expanded on them in novel ways.  A fundamental shift in attitude in defence, which has in part been driven by the 9/11 Commission’s findings, is that intelligence agencies must move from a ‘need to know’ to a ‘need to share’ mentality.  For me, this shift has echoes in the ‘linked open data’ ideals, and for someone who also bangs on about open source, open standards and all things open, it’s a step in the right direction.  Obviously, sharing information in a military context is a little different to the ideals of linked open data, but the fundamental issues around getting the information out of existing silos and usable by other systems are much the same.

A recent success in one such NATO project has been the decision to move from using a bespoke query protocol for RDF, a sort of ‘query by example’, to using the World Wide Web Consortium’s new semantic web query language, SPARQL.  This might seem obvious, but a few years ago when the protocol was being developed, SPARQL didn’t exist — therein lies another discussion on how organisations need to be more agile to be able to cope with the length of a ‘Web Year’.  As far as the project group are concerned however, this is great news, as they don’t have to come up with a ‘Standardization Agreement’ or STANAG to define their own query protocol and can instead just point to and use an existing standard, SPARQL.

Bonjour

Another standard used by this project is Bonjour (formerly Zercoconfig or Rendezvous), Apple’s open standard for no-nonsense network configuration and service discovery—the technology behind iTunes’ ability to discover and play music from other iTunes applications on different computers. The great thing about using Bonjour for this project is that it’s decentralized: there’s no need to set up the registry of information providers beforehand. Instead, each provider is free to publish their existence and their capabilities, both on a local and wide area network, without setting up any new infrastructure. Each service in a local network can publish its details in any number of domains. The clouds above represent local networks, the arrows represent the act of publishing, or registering service details to a domain, the cylinders represent services, and the ‘.local’ is the domain for the local network.  iTunes, for instance, currently publishes itself only to the local area network (which is different to the initial iTunes release where people realised they could share their music with friends over the internet by advertising their iTunes database to their own domain name, how terrible).  The domains themselves correspond to the normal domain name system (DNS) names we’re used to, since DNS is really at the heart of the Bonjour protocol.

Services can then be discovered simply by looking for a particular service type in a particular domain; a typical Bonjour service discovery query would ask for all printer services in the local domain.  These service types are given slightly cryptic names (iTunes, for example, uses _daap._tcp), but are all listed on-line at dns-sd.org/ServiceTypes.html for everyone to see and re-use.

Hello SPARQL

What we’ve done then is create a service type for SPARQL, allowing information providers to publish their SPARQL ‘endpoints’ to be discovered by information consumers.  Technically, the SPARQL service type is _sparql._tcp and is listed at dns-sd.org/ServiceTypes.html along with a short description of the properties which should be published.

In keeping with the RDF mantra that “anyone can say anything about anything”, we’ve tried to ensure that anyone can publish a description of not only their own SPARQL services, but also those of others.  A published SPARQL service record has two properties, one called ‘path’ and one called ‘metadata’, from which a client derives two URLs, the former pointing to the SPARQL service endpoint, and the latter pointing to some arbitrary (RDF/XML encoded) metadata it can fetch. The normal approach would be to just use simple paths (e.g. /sparql and /service-metadata.rdf) for the values of these properties, which would be interpreted as pointing to the discovered host (e.g. http://dbpedia.org/sparql and http://dbpedia.org/service-metadata.rdf).  However, by using a full URL for the value of either property, a service record can be published which points to a service or some metadata on a different host entirely, allowing us to form what are essentially proxy records for existing SPARQL services which don’t (yet) use Bonjour service discovery.

The Bonjour specifications are firmly geared towards making the user’s life simpler, so for instance while the name of a service is really just a normal DNS name, the specifications insist that it should be a human readable name with proper capitalization, spaces etc., although no dots are allowed.  The specifications also give guidance as to the sorts of properties and values a service record should contain, and are keen that service records be as concise as possible and leave much of the nitty-gritty details of discovering service capability to the main protocol, in our case SPARQL.

Into the voID

However, the SPARQL protocol doesn’t have much to say about what to expect of an endpoint other than, “here it is.”  Our approach to this has been to allow a little bit of extra information in the service record, for instance using the ‘vocabs’ property to declare the URIs of any vocabularies the service uses.  To find out more information about a service, a consumer can use the value of the ‘metadata’ property to fetch a fuller service description.  The only restriction is that this service description should be marked up in RDF/XML — what better way to encode metadata?

As to what this service description metadata should contain, well that’s currently left to the provider.  In an ideal world, using RDF and OWL to describe a service would mean that it is completely self-describing and that a consumer could fetch the document, fetch any referred ontologies, and figure out what it all means completely automatically.  In reality, we at least need some existing, vocabulary to refer to, even if the client can infer and interpret meaning using ontologies.  For us, that vocabulary is a bespoke ontology which is currently being worked on, but which is good enough for our needs right now.

So I was heartened to be shown a glimpse of voID last year, which upon further reading looks to offer exactly the sort of service description vocabulary needed to complement our Bonjour SPARQL service type.  Another great thing about RDF (see, I’m such a zealot), is that an RDF document can easily use multiple vocabularies, and the instance data doesn’t even necessarily need to be linked together, so the service description document can happily accommodate the use of  more than one service description vocabulary or ontology without breaking anything.  We’ll see what happens, but my vote would currently be for voID to be the suggested minimum requirement.

Why?

There’s currently a list of SPARQL endpoints on the World Wide Web Consortium’s ESW wiki, along with a comment along the lines that the list will probably not get much bigger in the long run.  The comment itself is well reasoned and not necessarily meant to be a negative one, but along with similar comments from peers certainly makes us stand back and wonder at the usefulness of any kind of registry, or even a decentralized set of registries, of SPARQL services.

There’s no doubt that the project I’m working with needs this right now, so it’s certainly useful, but what about when, and if, things get much bigger?  A well known web technologist has published Bonjour service types for HTTP and HTTPS at dns-sd.org, but lately there doesn’t seem to be much take-up for this kind of facility, even if Apple’s Safari web browser has built in support with its Bonjour bookmarks feature.

On the other hand, Google does such a wonderful job, giving us a single endpoint to query the whole of the web, that it’s tempting to think that the semantic web will ultimately be sucked up in a similar fashion into a great big semantic data warehouse in the clouds, and there’ll be no need for anyone to offer up their own SPARQL endpoints.

My own expectation is that we’ll end up with something in between.  On the one side there will be one or a few massive Google like entry points which will, by spidering across the web, perhaps following the linked data trail, suck up much of the semantic web and present it to users in a simple, easy to consume and re-use entry point.  But just as when you’re searching for something specific, for instance buying a book, or trawling through your bank statement, you don’t start at Google, I think there are plenty of cases where you’d want to be able to automatically discover SPARQL endpoints which you can access directly and which offer you precisely the details you need, there and then.

The semantic web is different to the (syntactic?) web, in that it makes it much easier to take information from multiple sources and join it together into something new and more useful.  Imagine if everyone’s desktop were part of the semantic web, with all the information about from emails, photos, music etc. offered up through SPARQL services (take a look at the Nepomuk project for an existing implementation of this).  Imagining the sorts of applications we could build on top of the Semantic Desktop is an exciting prospect, but it’s not something I’d expect most people would want to make available through a single Google style data warehouse, even if there were privacy safeguards in place, in the same way that most people don’t want to necessarily share all their photos on Flikr or Picasa.

Using Bonjour to discover SPARQL services means that a user can easily create or select a list of domains where their client applications will look for published enpoints, be they public or private, and given the right credentials they will be allowed to use those services to build the sorts of applications they want, no matter how esoteric: show me a list of all the photos of my son, sorted by the number of teeth he had, using my local photos as well as the photos taken by his grandparents; give me a list of washing machines, along with prices from these suppliers, where the washing machine is no wider than 60cm and has a Which Best Buy rating over 70%.

Of course, if we start making all this sort of information available through our own SPARQL services, then there are all sorts of issues around trust, privacy, provenance and accuracy, but at the very least we are in control of our own data.

Try It!

Using Bonjour to publish and discover SPARQL services is simplicity itself, and I invite you to take a look at www.floop.org.uk/eagle/discovering-sparql for some command-line examples.

While at the moment none of the SPARQL servers out there publish their details using Bonjour, I’ve created a simple application for Tomcat which can publish any of the services it runs using Bonjour, and have used it to publish Joseki based SPARQL services.  I’m also working on both a SPARQL endpoint and Bonjour publishing for Plone and Zope.  See http://www.floop.org.uk/projects for more details of these nascent projects.

What would be great is if the existing SPARQL servers and triples stores out there could add optional support for registering the details of any endpoints they make available, using Bonjour.  It would certainly help the uptake of SPARQL in the projects I’m involved with.

Alex Tucker is a self-employed semantic web consultant, specialising in defence.

Building coherence at bbc.co.uk

By Tom Scott and Michael Smethurst

| This article features in Nodalities Magazine, Issue 5

Telling (non-linear) stories

For the past 86 years the BBC has plied its trade as a storytelling organisation. In the world of linear broadcasting we’ve even gotten very good at it. Guiding the audience through complex news story lines, explaining the natural world and, interleaved narrative arcs and the plotlines of drama has become our forte. But storytelling in a linear world is different from storytelling in the non-linear, hypertext world of the web.

With the exception of BBC News Online (news.bbc.co.uk) the online world has often been seen as a supporting adjunct to the linear broadcast world. Over the years we’ve commissioned and built sites to provide online support for programmes; but we’ve too often taken our linear storytelling expertise and attempted to replicate the same techniques on the web – with mixed success. Unlike linear broadcast storylines the web doesn’t provide people with a predicted and controlled linear journey. Instead we dip in and out of any given website — following different journeys — to find the information we want at that time.

Many of our programme support sites have been commissioned and developed in isolation. So you see an Archers site and an Eastenders site and a Top Gear site which are internally coherent but which fail to link up other than via editorially determined cross promotions. Want to see who presents Top Gear? No problem, we can do that. Want to see what else those people present? Sorry, can’t do that. By developing self-contained microsites the BBC has produced some good stuff but it has also been unable to reach its full potential because it hasn’t managed to join up all of its resources. By failing to link up the content (on both a data and a user experience level) the stuff we publish can never becomes greater than the sum of its parts. Without these links we can’t make bbc.co.uk a coherent experience. As a user, it’s very difficult to find everything the BBC has published about any given subject, nor can you easily navigate across BBC domains following a particular semantic thread. For example, you can’t yet navigate from a page about a musician to a page with all the programmes that have played that artist.

So how do you tell stories on a web scale? We could stick with the easy option and try to control ‘user journeys’ across the site. Provide links to where we think the user should go next. But that’s little better than those flip a dice, go to page 30 dungeons and dragons books we all had as kids. We had to recognise that non-linear storytelling puts the narrative arc into the hands of the user. What to read, what to click, where to go next is really up to you. So storylines split and merge, meta-narratives emerge and fracture; ‘user journeys’ slip out of (editorial) control.

All of this comes from the power of the link – back to basics. But we can only provide precisely targeted links at the user experience level if those links exist at a data level. And that’s the difficult part. The organic growth of our sites has been mirrored in the organic growth of our content and data management systems. We currently have a range of systems across the business for managing different bits of content throughout the production chain. And like our public facing sites none of these speak the same language or share the same identifiers. A typical episode of Top Gear might have 6 separate identifiers on it’s way from scriptwriter to airwaves to archive. Once you’ve solved this problem you hit the problem of multiple identifiers for James May and once you’ve got one canonical James May you’re back to the problem of multiple identifiers for all the other programmes he’s presented…

Solving these problems makes for a more linked, more coherent bbc.co.uk. But an internally coherent bbc.co.uk isn’t enough. bbc.co.uk needs to be weaved into the rest of the web, not merely on the web. It needs to be linked in to all those other Top Gear / James May pages out there… Luckily the tips, tricks and techniques pioneered by the Linked Data community give us some clues here.

Add into this mix the fact that there’s some data the BBC can never hope to provide. So we know when an artist is played on radio or TV. But we can’t hope to know when they were born, or where they were born, or which bands they’ve been in, or who they’re married to etc. If we want to tell stories around music all this is important data. And we can only get it by tapping into the collective knowledge of the web.

BBC in the web of data

I’d like to claim that when we set out to develop /programmes we had the warm embrace of the semantic web in mind. But that would be a lie. We were however building on very similar philosophical foundations.

In the work leading up to bbc.co.uk/programmes we were all too aware of the importance of persistent web identifiers, permanent URIs and the importance of links as a way to build meaning. To achieve all this we broke with BBC tradition by designing from the domain model up rather than the interface down. The domain model provided us with a set of objects (brands, series, episodes, versions, ondemands, broadcasts etc) and their sometimes tangled interrelationships.

We were also convinced that the value in programme websites lay not in the implicit metadata of the domain model but rather in the way this domain model overlapped and intersected with other domains. As ever the links are more important than the nodes because that’s where the context lives: programmes:segment music:track, programmes:segment food:recipe etc. In this way we could weave new ‘user journeys’ into and out of /programmes, into and out of bbc.co.uk. From archive episodes no longer available online, to a recipe page, to a chef, to another recipe and back to a recent episode. Using well targeted content specific links we could not only escape the dead end content silos that characterised bbc.co.uk but point users back to programmes that would hopefully inform, educate and of course entertain.

Finally we believed in the merits of opening our data and building on top of other people’s open data. When we looked to rebuild bbc.co.uk/music we looked at a number of commercial providers of music metadata. They all did a similar job to MusicBrainz (musicbrainz.org) – similar models, similar data quality etc. But choosing to go with a commercial provider would have precluded our ability to provide any kind of machine friendly (API if you must) views. The decision to publish JSON or vanilla XML or RDF would have been a decision to give the 3rd party business model away. So we went with the open alternative – an open, public domain provider, one that is more in keeping with our public service remit and one that represents better value for money for the license fee payer – which has to be a lesson to someone.

Without ever explicitly talking RDF we’d built a site that complied with Tim Berners-Lee’s four principles for Linked Data:

  • Use URIs as names for things. – CHECK
  • Use HTTP URIs so that people can look up those names. – CHECK
  • When someone looks up a URI, provide useful information. – Well, if we’re only talking HTML, RSS, ATOM, JSON etc. CHECK
  • Include links to other URIs. so that they can discover more things. – Again if we’re talking HTML only CHECK

By keeping everything in its right place we’d also built a sane, maintainable, scalable, accessible site that search engines love and could be easily evolved to add new features and functionality. So to anyone considering how best to build websites we’d recommend you throw out the Photoshop and embrace Domain Driven Design and the Linked Data approach every time. Even if you never intend to publish RDF it just works.

Around this time we met by chance with some people from the Linking Open Data community and the two worlds collided. Obviously TBL wasn’t talking only HTML in the last 2 principles but aside from that the parallels were striking. We set about converting our programmes domain model into an RDF ontology which we’ve since published under a Creative Commons License (www.bbc.co.uk/ontologies/programmes/). Which took one person about a week. The trick here isn’t the RDF mapping – it’s having a well thought through and well expressed domain model. And if you’re serious about building web sites that’s something you need anyway. Using this ontology we began to add RDF views to /programmes (e.g. www.bbc.co.uk/programmes/b00f91wz.rdf). Again the work needed was minimal.

So for those considering the Linked Data approach we’d say that 95% of the work is work you should be doing just to build for the (non-semantic) web. Get the fundamentals right and the leap to the Semantic Web is really more of a hop.

Why bother with RDF?

For all the pages we’ve published we’ve only had a limited success at making this information available for others to use, to hack with and to build new services with. While we’ve not done a very good job of making bbc.co.uk a coherent experience for people the situation is worse for machines.
It is our belief that rather than publishing proprietary APIs it is better to use the ubiquitous technologies of URIs and HTTP. This approach supports the generative nature of the Web, making it easy for third parties to build with BBC metadata without learning BBC specific APIs and at the same time providing the BBC and its users with immediate benefits.

Services like Flickr, Twitter and the like have in many, many ways followed the same principles we adopted for programmes and music — or if they didn’t then the end results look pretty similar — they are wonderful services. However, if as a third party developer you want to deal with the semantics, accessing the data via the Giant Global Graph to find everything about a certain person, place or topic and you wanted to include data from Flickr then you will need to deal with the specifics of Flickr. I suspect that it wouldn’t be that difficult for Flickr to add RDF representations – if they did then Flickr content would be part of a common way of doing things. We want BBC data to be part of a common way of doing things.

Our hope in making BBC data available as RDF is that we will make it as generative as possible – helping others to do interesting things with our data. The BBC has a public service remit, a remit that means it should look beyond its internal business needs to help create public value around useful technologies and around its content for others to benefit from. The longer term aim of this work is to not only expose BBC data but to ensure that it is contextually linked to the wider web. We have started along this path by linking to Wikipedia (DBpedia in the RDF view) and MusicBrainz from the artist pages but this could be extended for programmes and events.

A conference comes of age: a review of the 7th International Semantic Web Conference (ISWC2008)

| This post will feature in Nodalities Magazine, issue 5.

What are the factors that indicate a coming of age? An increased
self-awareness perhaps, or an acceptance and understanding of a broad
range of views, even if they contradict your own? If these factors do
indicate a certain maturity, then I would argue that the International
Semantic Web Conference series has come of age.

Last year’s event in Busan, Korea felt like a watershed moment, with
an increasing focus on practical applications that exploited Semantic
Web technologies, in addition to the highly theoretical papers
typically seen at events of this sort. This year’s conference in
Karlsruhe, Germany, and the seventh in the series overall, maintained
this momentum. But more so than previous years I detected a subtle
change in the mood of the conference. In addition to a tangible sense
of excitement that the Semantic Web was getting ready for the
mainstream, I detected a certain pluralism within the community,
manifested as a greater openness to divergent views and an increase in
attention to topics that might have previously been overlooked.

This willingness to express and accept divergent views was apparent to
me no more so than in the panel titled “An OWL too far?”. This
discussion saw senior members of the Semantic Web community openly
challenge each others views on the proposed second version of OWL, the
Web Ontology Language. Perhaps the views held by the likes of Stefan
Decker, Frank van Harmelen and Ian Horrocks have always been divergent
on this issue, but seeing the differences of opinion aired so openly
was a new experience for me. Far from indicating a damaging lack of
unity in the field, I read this as a clear sign that the community can
engage in open and constructive debate without throwing the toys out
of the pram.

Earlier in the week I had sat on a similarly provocatively titled
panel in the OWL Experiences and Directions workshop – titled “How
might OWL fail?”. As a relative outsider I decided to focus on the OWL
community’s need to improve its marketing and demonstrate its
relevance to the wider world, and expected a degree of hostility to
this message. Instead I sensed a slight deflation at the criticism
that was quickly followed by a desire to engage with the problem and
actively address it.

Perhaps the most powerful sign of how far the Semantic Web community
has come was in the entries to the annual Semantic Web Challenge. This
year the contest had two tracks: the Open Track, which is analogous to
the regular challenge in previous years and has a more established set
of judging criteria; and the Billion Triples Track, an attempt to
stimulate people to generate value from and add value to increasingly
large data sets, with the definition of what constitutes “value” being
more open-ended.

The quality in both tracks was exceptionally high, but one feature
that ran through most of the finalists struck me in particular – the
emphasis on the user experience. Previous challenges have always
attracted user-oriented applications as well as backend technologies,
but this year felt different. Whether the application was supporting
personal aggregation of one’s distributed information, as in the Open
Track winner Paggr; enabling location-oriented browsing of the
Semantic Web on a mobile phone, as in DBpedia Mobile, which took
second place in the Open Track; or providing structured browsing over
billions of RDF triples, as in SemaPlorer, winner of the Billion
Triples Track; the vast majority of entries recognised the need to
both add value to the data *and* provide a compelling user experience
over this.

For me this indicates not just an awareness but an acceptance on the
part of the Semantic Web community that no amount of research and
development at the backend will make a difference if clear user
benefits are not delivered. If this serves as evidence that the ISWC
series has come of age, then I would argue that along with it so has
the Semantic Web community at large. It may have taken some time, but
I have no doubt that this maturity has been earned.

LIBRIS – Linked Library Data

| This post will feature in Nodalities Magazine, Issue 5
By Anders Söderbäck and Martin Malmsten

LIBRIS is the Swedish National Union Catalogue, or, in other words, the main gateway for bibliographic data in Sweden. LIBRIS consists of roughly six million bibliographic records, 20 million library holdings records, and two hundred thousand authority records on authors, titles and subject headings (“Svenska ämnesord”). The LIBRIS system is used for cataloguing by about 170 library organisations. The participating libraries come mainly from the academic sector (i.e. university libraries), but it is also possible to find museums, archives and some public libraries. Last but not least LIBRIS is also the home of the Swedish National bibliography. LIBRIS is created co-operatively, but is hosted and maintained by the National Library of Sweden.

Earlier this year, LIBRIS was published as Linked Data on the web, exposing the entire library state with all its records, links and relations. As far as we know, this is the first union catalogue or national library catalogue to be published in its entirety as linked data. Not counting lcsh.info (which is great, but contains no bibliographic information) it is the first effort by a national library to actually be part of the semantic web. We made this effort using a “data first” strategy, focusing on availability rather than a perfect representation of the database. The reason for this strategy was simple: We did not know the perfect structure for a bibliographic web of data. Neither did we want to spend our time just thinking about this. Libraries are, in our opinion, very good at thinking and rethinking bibliographic data. Actual reworking and restructuring of library data unfortunately seems less common. For us, the best way to get to know a technology—whether MARC records or the Semantic Web—is to get our hands dirty and work with it. Progress comes, we firmly believe, by learning from mistakes and learning to adapt to new environments, not by staying safely at home making up the perfect travel plan.

Learning also comes from talking and discussing with other people, which is why we wanted to provide something to talk about as quickly as possible. Trying to move beyond the library community, we wanted to use ontologies that were not library specific. For this reason we used Dublin Core, SKOS, FOAF and Bibliontology. Where no existing ontology seemed applicable (holdings, frbr relations), we made up our own. Using the identifiers from our database (MARC field 001, for all you library people out there), we created cool HTTP URIs for bibliographic and authority records. Following the four rules of Linked data specified by Sir Tim Berners-Lee, we tried to provide anyone who looked up those URIs with as much useful information and as many links to other URIs as possible. For the most part, this meant links to other resources within the LIBRIS dataset. Since we already had a good bespoke mapping of our Swedish subject headings to the Library of Congress subject headings, we generated approximately twelve thousand links to lcsh.info. In keeping with our desire to move beyond the library community, we also included a handful of links to wikipedia and dbpedia. In a not too distant future, we plan to automatically extend the number of these links to about thirty thousand.

All of the above is described in more detail in Making a Library Catalogue Part of the Semantic Web (Malmsten 2008). What is not described in this paper, however, is why we decided to depart into semantic web territory. One could of course attribute this departure to inborn curiosity. As librarians and developers we naturally want to explore new landscapes in the ecology of knowledge. But the reason why we started looking at linked data might just as well be more prosaic. In 2007 we made a major revision of our web OPAC. We did this using an open and collaborative methodology, focusing on user centred design as well as communication with just about anyone having an interest in what we were doing. During this process, we discovered that there was a huge interest in getting access to the data contained in the LIBRIS database in a machine readable way. This interest was not limited to the library community, but was expressed by a lot of people working outside the library space with no experience of library specific protocols such as z39.50 and MARC.

In 2008, having our data available in a machine readable-way feels just as natural to us as making our data available in a human readable interface. In building a library catalogue: our purpose and responsibility is to provide access to library resources. Limiting this access to paths that can only be walked by humans seem to be an unnecessary restriction, which certainly is not in the best interest of our users. The library catalogue might also be considered a resource in itself, and there is no reason why we should not make this resource available for the public just as we do with our books and other resources. Of course, this can be achieved without creating linked data on the semantic web. Why not, for example, use z39.50, SRU/W or OAI-PMH? The answer to this question is that we make our data available in those ways as well. We have also put up a HTTP based web service which can output LIBRIS data as MARC-XML, Dublin Core, Json, RIS or MODS. If we for some strange reason ran out of other things to do, we could probably spend the rest of our lives just building APIs. This situation called for a more rational approach.

A trouble with the above mentioned methods of data access is that they only provide information that is present in the original MARC records. No information is given about other linked resources. When looking up a certain resource in our OPAC, a human viewer is provided with a lot of information that is not present to the machine accessing individual records through a search/retrieve protocol. From this perspective, working with rdf and linked data is a lot more rational, since it makes it possible for us to expose the entire state of our library system in a web friendly way. Ed Summers, who is the person at the Library of Congress responsible for creating lcsh.info, describes this very eloquently in a blog post from the beginning of 2008. API:s, Ed writes, “…all differ in their implementation details and require you to digest their API documentation before you can do anything useful. Contrast this with the Web of Data which uses the ubiquitous technologies of URIs and HTTP plus the secret sauce of the RDF triple.” As a alternative to making open API:s for just about every aspect of our data, we hope that allowing our data to be crawled by human and machine alike will allow for new ways of discovery, as well as for people using our data in ways not even imagined by us. The Swedish Government has, as one of the objectives given to us as a National Library, stated that the LIBRIS systems shall be used as a broad information resource for the improvement of information management within research and higher education. Putting as few restrictions as possible on how our data can be used is, we feel, probably the best way to achieve this objective.

When it comes to the improvement of information management, the linking between different datasets made possible by linked data shows a lot of promise for cooperation and interoperability between libraries as well as for bridging the gap between libraries and other knowledge organizations. This is something we have only begun to explore. The aforementioned links to lsch.info are useful for making inferences about relations not present in our system of subject headings from relations present in LCSH. The value of these kinds of inferences will increase exponentially the bigger the cloud of linked library data grows. While we at this time only have a few links to dbpedia, we hope that the upcoming addition of a substantial number of dbpedia links will provide interesting possibilities for us as well as for others. Libraries have, over the last few hundred years years, collected huge amounts of what is usually very good data. Locked up in individual databases, this is good for retrieval of resources belonging to individual libraries. This, it needs to be stated, is not a bad thing! Publishing library databases in rdf but without links to other datasets might be good for individual libraries if only for the opportunity to use SPARQL, which is a query language we have fallen in love with and which we imagine might fulfill any librarians desire for good, exhaustive database querying.

Linked library data provides amazing opportunities for cooperation about, for example, authority data. This way of making use of each others intellectual efforts seem to us an effective way of improving the quality of the individual catalogue, while at the same time improving the quality of the web at large. Authority databases are amazing resources, and probably useful for other purposes than just improving search in the individual OPAC. Exposing library data might also be a good way to get feedback on our data from outside the library community. Visibility and open communication provides good ways of quality improvement, which has been proven again and again in science as well as in society. Libraries might be good data providers and a cloud of linked open library data might give rise to interesting perspectives, exciting new applications and better competition between developers, be they commercial system vendors, ad-funded search engines, or in house library development teams. Such competition probably ends up benefiting our users, because in the end libraries are not about catalogues, bibliographic formats, databases or OPACS. Libraries are about openness, about dissemination and access to research and cultural heritage, about science, memory and democracy. For this reason, libraries need to stop worrying and embrace web standards. A web of data without library participation is a bad thing, not only for libraries but also for the web.

Enabling the Linked Data Ecosystem

|This post will feature in Nodalities Magazine, Issue 5

The Linked Data web might usefully be viewed as an incremental evolution beyond Web 2.0. Instead of disconnected silos of data accessible only through disconnected custom APIs, we have datasets that are deeply connected to one another using simple web links, allowing applications to “follow their nose” to find additional relevant data about a specific resource. Custom protocols and data formats are the realm of the early web; the future of the web is in an increased emphasis on standards like HTTP, URIs and RDF that ironically have been in use for many years.

Describing this as a “back to basics” approach wouldn’t be far wrong. Many might dispute that RDF is far from simple, but this overlooks the elegance of its core model. Working within the constraints of standard technologies and the web architecture allows for a greater focus on the real drivers behind data publishing: what information do we want to share, and how is it modelled?

Answering those questions should be relatively easy for any organisation. All businesses have useful datasets that their customers and business partners might usefully access; and they have the domain expertise required to structure that data for online reuse. And, should any organisation want some additional creative input, the Linked Data community has also put together a shopping list [1] to highlight some specific datasets of interest. This list is worth reviewing alongside the Linked Data graph [2], to explore both the current state of the Linked Data web and the directions in which it is potentially going to grow.

Beyond the first questions of what and how to share data, there are other issues that need to be considered. These range from internal issues that organisations face in attempting to justify the sharing of data online, through to larger concerns that may impact the Linked Data ecosystem. For the purposes of this of article, this ecosystem can be divided up into two main categories: data publishers, who publish and share information online; and data consumers, who make use of these rich datasets.

There is obvious overlap between these two categories: many organisations will fall into both camps, as do we all through our personal contributions to the web. However, for this paper I want to focus primarily on business and organisational participants, and attempt to illustrate the different issues that are relevant to these  roles.

Data Publishers Perspective

The first issue facing any organisation is how to justify both the initial and ongoing effort required to support the publishing of Linked Data. Depending on existing infrastructure this may range from a relatively small effort to a major engineering task—particularly true if content has to be converted from other formats or new workflows introduced. In “A Call to Arms” in the last issue of Nodalities [3], John Sheridan and Jeni Tennison provided some insight into how to address the technology hurdle by using technologies like RDFa.

But can this effort be made sustainable? Can the initial investment and ongoing costs be recouped? And, if a dataset becomes popular and grows to become very heavily used, can the infrastructure supporting the data publishing scale to match?

The general aim with enabling access to data is that it will foster network effects, and drive increasing traffic and usage towards existing products and services. There are success stories aplenty (Amazon, Ebay, Salesforce, etc) that illustrate that there is real and not imagined potential.

But this justification overlooks some important distinctions. Firstly for some organisations, e.g. charities and non-governmental organisations information dissemination is part of their mission and there may not be other chargeable services to which additional traffic may be driven. In this scenario everything must be sustainable from the outset. Secondly, it also overlooks the fact that the data being shared may itself be an asset that can be commoditised. The value of access to raw data, stripped of any bundling application, has never been clearer, or been easier to achieve. New business models are likely to arise around direct access to quality data sources. Simple usage-based models are already prevalent on a number of Web 2.0 services and APIs—the free basic access fosters network effects, while the tiered pricing provides more reliable revenue for the data publisher.

Software as a service and cloud computing models undoubtedly have a role to play in addressing the sustainability and scaling issue, allowing data publishers to build out a publishing infrastructure that will support these operations without significant capital investments. But few of the existing services are really firmly targeted at this particular niche: while computing power and storage are increasingly readily available, support for Linked Data publishing or metered access to resources are not yet common-place.

This is where Talis and the Talis Platform have a distinct offering: by supporting organisations in their initial exploration of Linked Data publishing, with a minimum of initial investment, and a scaleable, standards based infrastructure, it becomes much easier to justify dipping a toe into the “Blue Ocean” (see Nodalities issue 2 [5]).

Data Consumers Perspective

Let’s turn now to another aspect of the Linked Data ecosystem, and consider the data consumers perspective.

One issue that quickly becomes apparent when integrating an application with a web service or Linked Dataset is the need to move beyond simple “on the fly” data requests,  e.g. to compose (“mash-up”) and view data sources in the browser, towards polling and harvesting increasingly large chunks of a Linked Dataset.

What drives this requirement? In part it is a natural consequence and benefit of the close linking of resources: links can be mined to find additional relevant metadata that can be used to enrich an application. The way that the data is exposed, e.g. as inter-related resources, is unlikely to always match the needs of the application developer who must harvest the data in order to index, process and analyse it so that it best fits the use cases of her application.

Creating an efficient web-crawling infrastructure is not an easy task, particularly as the growth of the Linked Data web continues and the pool of available data grows. Technologies like SPARQL do go some way towards mitigating these issues, as a query language allows for more flexibility in extracting data. However provision of a stable SPARQL endpoint may be beyond the reach of smaller data publishers, particularly those who are adopting the RDFa approach of instrumenting existing applications with embedded data.  SPARQL also doesn’t help address the need to analyse datasets, e.g. to mine the graph in order to generate recommendations, analyse social networks, etc.

Just as few applications carry out large scale crawling of the web, instead relying on services from a small number of large search engines, it seems reasonable to assume that the Linked Data web will similarly organise around some “true” semantic web search engines that provide data harvesting and acquisition services to machines rather than human users. Issues of trust will also need to be addressed within this community as the Linked Data web matures and becomes an increasing target for spam and other malicious uses. Inaccuracies and inconsistencies are already showing up.

The Talis Platform aims to address these issues by ultimately providing application developers with ready access to Linked Datasets, avoiding the need for individual users and organisations to repeatedly crawl the web. Value-added services can then be offered across these data sources, allowing features, such as graph analysis (e.g. recommendations), to become commodity services available to all. The intention is not to try and mirror or aggregate the whole Linked Data web, this would be unfeasible, but rather collate those datasets that are of most value and use to the community, as well as shepherding the publishing of new datasets by working closely with data publishers.

As an intermediary, the Talis Platform can also address another issue: that of scaling service infrastructure to meet the requirements of data consumers without requiring data publishers to do likewise. It seems likely that data publishers may ultimately choose to “multi-home” their datasets, e.g. publishing directly onto the Linked Data web and also within environments such as the Talis Platform in order to allow consumers more choice in the method of data access.

Conclusions

The bootstrapping phase of the Linked Data web is now behind us. As a community, we need to begin considering the next steps, especially as the available data continues to grow.  This article has attempted to illustrate a few from a wide range of different issues that we face. While technology development, particularly around key standards like SPARQL, rules and inferencing, and the creation of core vocabularies, will always underpin the growth of the semantic web, increasingly it will be issues such as serviceable infrastructure and sustainable business models that will come to the fore.

At Talis we are thinking carefully about the role we might play in addressing those issues and playing our part in enabling the Linked Data ecosystem to flourish.

[1]. http://community.linkeddata.org/MediaWiki/index.php?ShoppingList
[2]. http://richard.cyganiak.de/2007/10/lod/
[3]. http://www.talis.com/nodalities/pdf/nodalities_issue4.pdf
[4]. http://labs.google.com/papers/bigtable.html
[5]. http://www.talis.com/nodalities/pdf/nodalities_issue2.pdf