Nodalities

From Semantic Web to Web of Data
Nodalities

Subscribe

  • Any Podcatcher
  • Any Feed Reader

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Nodalities Magazine' Category

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

DataIncubator: What Is It and What's In It?

by Leigh Dodds

| this article first appeared in Nodalities Magazine, issue 8

The Linking Open Data project has had a huge amount of success in bootstrapping the burgeoning Linked Data cloud. There’s now a definite sense of momentum behind the project, and a growing number of organisations are now seriously investigating how their data could further enrich the growing Semantic Web, and how the underlying technologies may help them to innovate and explore new opportunities.

The Linked Data community has rightly begun to look at the next round of challenges: What can we do with all this data? How can it be pressed into service to create new applications? What kinds of frameworks do we need to support consumption of Linked Data? But it is important that we shouldn’t lose sight of the fact that there’s still a huge amount of evangelism to be done and a great deal of data that could and should be part of the web of data. The Linked Data landscape is still not fully mapped out. In short, we need to keep up the process of accumulating, converting, publishing and linking data in as many different subject areas and disciplines as possible.

To date, the bootstrapping process has been supported by a number of community lead projects that convert and re-publish datasets to bring them into the web of data. The recently founded DataIncubator project (http://dataincubator.org) aims to adopt this same “show don’t tell” approach, but with the addition of some best practices and with an eye on long term sustainability.

Sustainability, Repeatability, Reusability

A key goal of the project is to lightly formalise the way these dataset conversions are carried out to make sure they are sustainable, repeatable, and reusable. But why are these particular aspects important?

Firstly, lets consider sustainability. As usage of the Linked Data cloud grows, we need to make sure that new data being added isn’t going to disappear later—e.g. because a small project website goes offline; or because the original project owner loses interest. It is critical that as serious applications begin to be built against this data that consumer can rely on it. One of the primary ways the project is ensuring sustainability is through making use of the Talis Connected Commons scheme (http://www.talis.com/cc). All of the public domain datasets that are converted and published through the DataIncubator project site are being hosted in the Talis Platform. This takes full advantage of the free data hosting offered under the Connected Commons initiative. Talis is therefore contributing to the sustainability of that data.

The second aspect to consider is repeatability. The first goal is to make sure that the data conversion process is itself repeatable—that is: we can easily re-generate the data to allow for modelling changes, bug fixes, and the ingesting of new data. And not just now when a project is active, but in three years time when the project may be picked up and extended by a number of other contributors. Ensuring that each of the incubated datasets is supported by open source code makes this more achievable. Ideally, the original dataset owners will be convinced by the benefits long before a project goes stale, but it’s important to recognise that evangelism can take time and that different industries move at different speeds. There are already a few Linked Data and RDF projects on the web that model and re-publish the same basic dataset in other ways. By trying to build a community around curating the conversion of a dataset and not just the data itself, DataIncubator hopes to avoid these issues.

The final aspect is one that is often over-looked: how can the original dataset owner build on what the community has created? How can the community’s efforts by reused? Reusability is enabled by ensuring that the conversion code is open source and that schemas and modelling design decisions are well documented. This can lower the barrier to entry facing data providers or publishers looking to embrace Semantic Web technology. This is the case particularly where the data conversion is acting on source data(e.g. open, but not linked data). In this case, the data owner may merely need to re-run the data conversion and publish the Linked Data through their own site rather than DataIncubator. This makes adoption much, much easier.

Community Norms

Alongside addressing these procedural aspects of the data conversion process, the DataIncubator project also encourages a number of useful community norms that will hopefully improve the quality of the converted datasets.

The first of these is to ensure that there is a sufficient amount of both linking and attribution. Every dataset within the umbrella project should reference its original sources. This should not take place just at a high-level, such as within in the corresponding Void description: http://rdfs.org/ns/void/. Instead, references should be deeper so resources can be associated with, for example, the original web pages that describe them. This ensures that there is a clear path back to the original source of the data. Attribution—in various forms—is an important community norm in its own right, but it is especially important in the context of converting and re-publishing an existing dataset. We want to ensure that the original curators of the data don’t think that the community is trying to appropriate or steal its work. Quite the opposite, we want them to embrace it.

The other norm relates once again to sustainability. Links to the data should be stable, but how do we achieve this if the data will ultimately be removed from the DataIncubator site and moved to another domain? The proposal here is that as data is migrated to its permanent home, redirects will be put into place to ensure that web browsers and semantic web agents can follow the links to their primary source. Every effort will be taken to ensure that links don’t break.

What’s In It?

The DataIncubator project already has a wide range of datasets available:

There’s a lot more that could yet be added to this list. My personal wishlist includes a conversion of the Prelinger Archives (http://www.archive.org/details/prelinger). This is hosted as part of the Internet Archive project and consists of over 2000 industrial, educational, travel, and propaganda videos published from 1903 to the 1970’s. The content is completely within the public domain, so it’s just begging to be converted. It would also be a great dataset on which to explore the modelling of media and media annotations in general.

Currently, one domain with very little Linked Data is gaming, in all of its forms. For example there is a vast amount of community curated data about Lego, Lego sets, and Lego models. And what about all of the facts and figures that are routinely collected around online gaming? Data might be available through specific community websites, but what could be built if the data were more open, allowing the community to analyse and re-present this data in new ways?

It strikes me that games and gaming is an area that is ripe for exploration. There are many interesting dimensions to the data, and the communities are very engaged. Many gamers are typically very interested in statistics and data about the games they play. This is just one area of the Linked Data landscape that the DataIncubator project is hoping to help explore.

Trends and Barriers

|This article first appeared in Nodalities Magazine, Issue 7

For anyone following the Nodalities blog, you may have read some of my recent posts discussing the trends boiling up around Web 3.0 (other buzzwords are available). The Mobile Web and upgraded connectivity in general; the rise of ubiquitous computing from chips in every product imaginable; Linked Data and the “Semantic Web” as an organising platform for this rising tide of data—these are three very broad trends seeing a lot of media attention presently. From where I’m standing, I tend to see the next great turning point of the Web as a convergence of some of these trends, and see it as a rise in the importance of and reliance upon data itself and data tools generally.

The mobile web is bringing new sorts of information to people, and they can make use of this info wherever they happen to be because of advances in devices ad connectivity. As phones and web-enabled devices get better, so to do the chips we seem to have embedded all over the place, and we can now begin to have a more clear picture of what we do through the information we gather from our heaters, cars, and pedometers. Also, as more objects become connected, the grunt-work of number-crunching and storage is becoming commoditised into big, efficient, utility-like cloud services, which host and work with our collected information much more effectively than the gadget in your hand could ever hope to do. Others, like us here at Talis, talk about the Semantic Web, which allows for an evolution from a bunch of connected documents to the explicit connections between bits of information.

Also fermenting in this mix is a strengthening trend of political transparency and a public, shared ownership of social data. Barack Obama’s new administration has clearly made this a priority with the launch and work around data.gov; and in the UK, Sir Tim Berners-Lee himself has been appointed to an Parliamentary advisory role. There is growing pressure to be able to have access to public data, and to see it as belonging to the nation’s people rather than allowed to be legitimately filed away in the great, locked bureau of the capitols.

So, picking up two fairly obvious trends here: Social, Public Data and Linked Data; it would seem to follow that people would begin to have access to previously unavailable information in usable, linked forms. And it’s certainly beginning, as articles elsewhere in this magazine have illustrated. But, what about other chunks of public data? What about when data comes from universities, institutions, scientific foundations and NGO’s? What about charities monitoring crime, CO2 emissions and family histories? Wouldn’t these make a useful piece in the web of social data? What resources have the governments themselves got, if they want to make their public-owned data available in a useful format?

These questions form a major part of the thinking behind Talis’ Connected Commons initiative (talis.com/cc). Basically, Talis has made its Semantic Web platform (including data hosting and access tools) available free of charge for any datasets made available to the public. In doing so, we’re hoping to remove the barrier of cost entirely to publishing interesting data in a Linked Data way. One major reason for this is to promote reuse and mashups of this interesting data, and for people to be able to “follow their noses” to the data that completes their projects. But, from a publishers’ perspective, this is important, because it’s removing a major reason not to bother with making data useful, if not only public. So, with this, data can be made public and useable and the developers and users get the benefit of public SPARQL endpoints and API access to interesting data.

To keep the data open and public, datasets need to make use of either the Public Domain Dedication and License (PDDL) or Creative Commons’ CC0 license. Ian Davis, in his article in this magazine, explains more about waivers and the Connected Commons, and there is a lot more about this particular initiative over on the Talis site (talis.com/platform/cc/faqs/).

In a recent interview with the BBC, Sir Tim said: “This is our data. This is our taxpayers’ money which has created this data, so I would like to be able to see it, please.” I wonder if initiatives such as Connected Commons will begin to remove excuses, hindrances, and obstacles? As public awareness of the importance of access gets hotter, this might become a political issue, as well as a pragmatic one. I hope that in the rush to publish data, and in the ensuing discussion and debate that follows, that the users, hackers and developers don’t get sidelined. I think the world is ready for its data back.

The Greatest Challenge Facing IT

by Lee Feigenbaum and Mike Cataldo

|This article features in Nodalities magazine, Issue 7

As the old adage goes: Time is money.

Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.

anzo-on-the-web-1A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.

This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.

Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.

Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.

Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.

Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.

In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.

The solution is to find data by its meaning rather than its location

The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.

anzo-on-the-web-2With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.

Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.

Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.

Data Collaboration and the Semantic Web

It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:

  • Flexible, define-as-it-arrives, data structures
  • Explicit relationships that travel with the data
  • Data that is accessed by its definition rather than its address
  • Distributed query

As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.

Practical Data Collaboration to Solve IT’s Challenge

Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:

  1. Virtualizes the data so that it is accessible by its description regardless of location.
  2. Lets users create their own views of data.
  3. Fills in the views by traversing the fabric and picking out the relevant information.
  4. Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.

The Right Data…

anzo-for-excel-1At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.

But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.

…For The Right People

Getting the data in front of the right people relies on three things: context, security, and “reach”.

Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.

Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.

Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.

…At The Right Time

Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.

First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.

Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.

Data Collaboration in the Days to Come

Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.

What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.

Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.

Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.

Enhanced by Zemanta

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

RDFa and Linked Data in UK government web-sites

By Mark Birbeck

| This article will feature in Nodalities Magazine, Issue 7

The UK government’s Central Office of Information had a straightforward problem to solve: how could they create a centralised web-site of information that the public could search and access, when the source of that information could be any government department
database or any public sector web-site?

For example, different organisations, such as Her Majesty’s Revenue and Customs (HMRC) or the National Health Service (NHS) would each post job vacancies to their own web-sites, but there was no central site that the public could go to, to find all public sector vacancies. This would be a problem at any time, but in the midst of attempts by the government to help people through the recession, it’s crucial to ensure that the public knows what vacancies are available. It might not occur to someone looking for a job as a plumber or an electrician they they should visit the NHS or Army web-sites, so a centralised site could make a big difference.

civil-service-vacancy

Similarly, as in most modern democracies, government departments are constantly seeking feedback from the public and interested parties, about specific issues. But as with job vacancies, these consultations are on departmental sites, rather than being available on a central site; from the Department of Energy and Climate Change (DECC) seeking feedback on clean coal, to the Ministry of Justice (MOJ) providing an opportunity for people to comment on prisoners’ voting rights, each department manages its own publication of consultations.

Traditional solutions

Traditional answers to these problems would have been to either (a) impose on each of the departments that they should key their data directly into a new central database (which would in turn drive the central web-site), or (b) create complex communication pipelines that would allow the decentralised databases to communicate with the central system.

And either of these solutions would almost certainly have turned out to have been a non-starter.

The first solution was unlikely to ever get off the ground, because it would have required each department to replace their existing technology with something new. Even if there was agreement on what that technology should be—and that in itself could take an age to resolve—there would have been a need for new development work, retraining of users, porting data from older systems, and so on.

The second ‘traditional’ solution at least has the merit of keeping existing systems intact, but would have required additional interfaces to be created to move the data from the departmental servers to the centre; each department would have had to create an interface between their own system and the central one.

Just getting one department into a situation where they could centralise their information would have been a major undertaking—not only were there lots of departments to consider, but each department was using a different technology to publish their vacancies or consultations to the web. For example, some departments with only a small number of job vacancies would likely use static HTML pages. Other departments, perhaps with larger IT departments, might use ASP.NET or a Java-based system.

Enter RDFa

The RDFa answer to this set of problems is simple—both conceptually, and to implement.

RDFa allows HTML publishers to embed RDF into their pages, so using the HTTP and HTML infrastructure to publish their information. This simple method of publishing data in turn means that any system can import this data, just by obtaining (or creating) an RDFa parser.

In short, each department can keep their own data management system, and simply add code to their existing web-page publishing step to augment the HTML with the data as RDFa. The central system in turn only needs one import mechanism—something that understands RDFa.

Adding this facility to an individual departments publishing system proved to be very quick and straightforward. But it’s not just UK government departments that are finding it straightforward to add RDFa to their pages. It was interesting to hear at SemTech in June that Google’s rich snippet launch partners (such as Yelp), were able to add RDFa support in “roughly a day”.

RDF publishing techniques

Adding data to web-pages might seem quite an obvious technique, but there are two important things to note here.

First, the COI has to be commended for having the vision to publish RDF at all. Of course, now that Gordon Brown has asked for Sir Tim Berners-Lee’s help in making government data publicly available, it seems pretty obvious—indeed it may even become fashionable! But the COI were planning this project at least a year ago, and at that time RDF was by no means a done deal (and you could say it’s still not).

But the second important thing is that even after deciding to publish RDF, it’s still not immediately obvious that the solution should involve RDFa, especially not a year ago.

The usual means of publishing RDF is to provide a distinct source of data in the form of RDF/XML (and perhaps other formats, too, such as N3). If there is an HTML version it usually exists for the purpose of describing the data itself. In other words, the RDF/XML format is primary, which means that anyone who is publishing HTML pages but wants to publish RDF as well, will need to add an extra piece of infrastructure that exists alongside their web-pages.

RDFa turns this on its head, and says that the HTML page is the data. One and the same page can be read as an HTML page, or as an RDF page, which in turn means that the changes required to the existing publication system are minimal. The COI once again showed its far-sightedness by adopting this technique.

Turtles all the way down

searchmonkey-fcoBut the benefits of RDFa don’t just stop there. Firstly, because the data is being published via HTTP and HTML, it’s possible for anyone to read the same data, not just the centralised web-site that was being planned. This means that third party job vacancy sites, for example, could import vacancies from relevant departments, to add to their databases. In fact, one of the main drivers for the consultations project was to try to help improve the accuracy of an already existing web-site (set up by a member of the public) that used ’screen-scraping’ to try to keep up with the available consultations—RDFa provides much more accurate information.

rdfa-in-govIn addition, the centralised web-site will not only import RDFa but publish it too. This means that third-party servers are also able to import some or all of the centralised data, into their own sites.

And thirdly, by using RDFa the sites could provide information to search applications such as SearchMonkey.

As more servers both consume RDFa from one set of servers, and publish RDFa again to a variety of other servers, we enter the exciting world of Linked Data, and it’s ‘turtles all the way down’.

Conclusion

By using RDFa to address the challenge of making distributed data available in one place, the COI avoided having to make changes to each department’s systems. But once each department is publishing RDFa, it becomes possible for third parties to consume that information however they see fit. Such a flexible architecture is crucial in the age of open government, and is a cornerstone of linked open data.

Mark is managing director of Backplane Ltd. (http://webBackplane.com/), a London-based company involved in a number of RDFa/linked data projects for UK government departments. He is the original proposer of RDFa.

Might Semantic Technologies permit meaningful Brand relationships?

| This post will appear in Nodalities Magazine, Issue 7

by Paul Miller

Much has been written about growing Enterprise use of social media (usually Twitter, these days) to successfully track and mitigate customer complaint. Many have been quick to spot that the disproportionately high cost of satisfying (or, more cynically, silencing) these early adopters is unlikely to scale effectively as an increasingly large cohort of customers move onto these services, and it must remain an open question as to whether ComcastCares and its peers can survive any move to the mainstream in recognisable form.

It appears, though, that Enterprise engagement in the social sphere changes the game far more significantly than merely enabling a select few twitterati to jump the Customer Support queue, and that this change is worth effort and investment in order to ensure that it does scale. What’s actually happening is that a relationship is being enabled between a brand and those that Seth Godin might recognise as its tribe; a relationship in which interactions are no longer driven predominantly by the desire to seek redress. Rather than only raising those issues serious enough for us to have written letters or endured telephone muzak in the past, we now comment on issues at the periphery of a brand. Collectively, we’ve moved from simply complaining about the worst failures of companies, their products and their employees, toward emitting an impressive stream of FYIs. Individually insignificant, and possibly unimportant, together these light touches on and around a brand build into an ever-changing and valuable commentary that brands and the corporations they front would do well to take notice of. The minor niggles about an otherwise exemplary service, the human touches that made us smile, the odd inconsistencies in a polished persona; none are enough to make us pick up the phone, but we comment upon them endlessly in Twitter, Facebook, FriendFeed and elsewhere, and by tapping into this fundamentally honest stream of consciousness there is much for those about whom we comment to learn. Good companies probably already know about fundamental failings in a product long before their customer support operation melts down under the weight of complaints or their quarterly sales targets are seriously under-achieved. Do they have as good a handle on the things we love? Do they have a clue about the minor gripes of customers outside their pre-launch polling groups? Do they know about the gut reaction to a colour, a touch, a smell, or a careless word that persuaded a likely prospect to buy a technically or aesthetically inferior product from the competition instead? All this and more is there for the taking in the stream of online chatter freely directed their way.

Semantic Technologies aren’t often directly associated with the worlds of Marketing and Commerce, yet individuals such as Eric Hillerbrand and Scott Brinker are hard at work to show just what might be possible when the experiences of the Semantic Web are applied to this space. Brands are no longer owned by the companies in whose name they were created. Increasingly, ownership of various forms is being asserted by the multitude of stakeholders with effort and attention invested in the brand. They care about it, they care about what it says about them, and they play a clear role in the brand’s evolution whether its managers want them to or not.

Brands need to engage in this conversation, as we are beginning to see them do, but they also need to discover the means to cost-effectively monitor and engage with a potential flood of third party reaction whilst using the Business Intelligence tools available to them in nimbly shaping public opinion to their advantage wherever possible.

I spoke with Scott Brinker last year, to explore his—then nascent—views on Semantic Marketing, and look forward to hearing his latest thoughts at the Semantic Technology Conference in San Jose in June.

More recently, Eric Hillerbrand talked about some of his ideas with respect to ‘Social Commerce,’ and the ways in which commercial organisations might seek to strengthen and exploit relationships with their customers, aided by a range of semantic technologies.

We’re just beginning to grasp the realities of a world in which tightly controlled and fiercely guarded brand attributes become increasingly permeable. For those companies with the confidence and foresight to loosen their grip, whilst simultaneously exploiting the wealth of data and new opportunities to engage, there is much to be gained. For the dinosaurs that hang on to ‘their’ brand in spite of the world around them, there is everything to lose.

Linked Data and the Public Domain

We love data at Talis and we want as much of it to be freely reusable as possible. In fact, because we wanted to see even more reusable data we recently launched the Talis Connected Commons offering completely free hosting of public domain data. We believe that dedicating data to the public domain is the best way to ensure that data is universally reusable and remixable. When data is public domain it means that it can be reused automatically without needing to check terms and conditions or track the source of every statement to provide attribution. These kinds of things act as friction to reuse, wasting energy that could be better spent creating inspiring things.

We also firmly believe that, in the future, there will a significant role for other forms of data licensing, including commercial access. We will support those efforts too when the time comes but today the Linked Data web needs more and better data that is freely accessible.

Licensing vs Waivers

You are probably familiar with the process of licensing a creative work, most likely through the great job that Creative Commons have been doing in recent years. However, the concept of waivers is less well known but highly relevant to reuse of linked data.

Whenever you create something you have automatic rights over it granted to you. The best known of these rights is copyright, which gives you the exclusive right to make copies of your creative work. There are many other rights which can be held over intellectual property such as design rights, trade marks, registered designs, performers rights, trade secrets, database rights, publication rights and many more.

Licensing is the process of granting others limited use of rights you possess. For example, when you license your copyright you are granting specific people a limited right to make copies without having to ask you first. Licensing of one right does not affect your possession of the others. For example you could grant the right to copy your work but retain the right to perform it. Creative Commons licenses are mostly concerned with copyright, but they do not usually deal with the other rights such as database rights or trade secrets.

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

The Licensing Problem

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

There isn’t a Creative Commons license for every possible right and there probably can’t be because of the huge variation in rights granted in different jurisdictions around the world. Also, when we start to look at licensing compilations of data we find that the situation becomes complex because you have to consider both the database and its contents seperately. For example a document of articles would be subject to database right over the whole collection and individual copyrights for each article, quite possible to many different owners. The Open Data Commons has addressed this particular example with its Open Database License and Database Contents License (based on work originally donated by Talis). If a standard license doesn’t exist then you need to hire lawyers and write one for yourself – a potentially huge cost.

Our collective goal for a successful Linked Data web has to be to protect consumers of the data: the people who are remixing many different sources of data. Our intentions may be very honourable, but people need certainty if they are to build enduring value on data. Creative Commons licenses are irrevocable so even if you lose control over your work through some misfortune, the people reusing it will be protected forever. Imagine this scenario: you allow people to use data you have collated but your company goes bankrupt and the rights to the data collection are sold by the liquidators. If you hadn’t licensed your rights explicitly then every one of your users could be liable to be sued by the new rights holder!

This is where waivers of rights can help. By explictly waiving your rights over your data then you are giving your users the best guarantee of safety that you can. Even if you lost control of the data collection subsequent owners could not persue your users because the rights you held have already been waived.

There are two waivers of rights that can be applied to datasets:

Both of these waivers can be used for data intended for submission to the Talis Connected Commons.

Community Norms

When you apply a waiver like CC0 you are relinquishing all your rights over the work to the fullest extent possible under the law. That means that you cannot force people to attribute you or stop them from making commercial use of your work.

The preferred approach is to attach a set of community norms to the work. These are like a code of conduct for use of the work and are usually self-policing. They are not legally enforceable but form part of the ethical or professional requirements for participating in a community. The best known example of community norms are the citation standards used in the academic commnity. Citing pre-existing work is not legally enforceable but those who abuse the norms can find themselves excluded from the academic community.

The Open Data Commons has published a set of attribution and share-alike norms which asks that users of the data:

  • Share work derived from the data.
  • Give credit to the original data publisher.
  • Point others at the source of the data.
  • Publish in open formats.
  • Avoid using digital rights management.

How to Declare Your Waiver

To delare your waiver in a machine readable way, you should first create a voID description of your dataset. VoID, or Vocabulary of Interlinked Datasets, is a vocabulary designed to describe key attributes of your dataset. We created a waiver RDF vocabulary that can be used with voID to declare any waiver of rights and the community norms around a dataset.

In this example we describe a dataset using the void:Dataset class and provide it with a dc:title as a minimal human readable description. You should add other descriptive properties as necessary (some suggestions can be found in the voID guide).

We then use the wv:waiver property (defined in the waiver RDF vocabulary) to link the dataset to the Open Data Commons PDDL waiver. We use the wv:declaration property to include a human-readable declaration of the waiver. This is purely informational, but can be immediately be used by a person examining the voID description. Finally we use the wv:norms property to link the dataset to the community norms we suggest for it, in this case the ODC Attribution and Share-alike norms.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
    <wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

Alternatively if you were to choose the CC0 waiver without any particular norms then you should use the following RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/"/>
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

These examples show that it is very simple to declare your waiver. However, before you do so be sure to read carefully what rights you are irrevocably giving up. For example you would most likely be waiving your publicity and privacy rights, so if your image is included in the dataset you could not later complain that someone is using it in a way you do not approve of. If you are worried about how your work will be used, if you want to legally require attribution, or if you don’t want people to make money off of your work, then you should not use a waiver and instead seek legal advice on the creation of a data license specific to your needs.

Linking Data and Semantics at O’Reilly

By Gavin Carothers and Charles Greer

|This article features in Nodalities Magazine, Issue 6

O’Reilly Media lives on the cutting edge. We coined terms such as Web 2.0, created the first commercial website in 1993, and exist to “spread the knowledge of innovators.” With our evangelists, conference presenters, authors, and bloggers all communicating and catalyzing new ideas, many believe that O’Reilly must be just as technologically innovative in our own operations. However, O’Reilly employs about 200 people but only half a dozen developers, so naturally ideas are thrown at our developers faster than it is possible to implement them. We’ve been known to refer to this tension between our public position on the cutting edge and internal expectation to live up to what we preach as “gaping wound tech.” Any time someone had a new idea or a new product to launch that didn’t quite fit into existing systems, we found some way to shoehorn it in, with a quick Perl script or some clever custom SQL. As we did this, more and more of our work became preventing our systems from collapsing under the weight of those one-off ETLs and scripts. The cost of simply keeping track of which scripts were using what bit of transformed data and where that data came from had became so high as to become unsustainable. We’d accrued so much design debt that only the most radical of approaches could save us from being crushed by the weight of our inherited code.

Of course, we didn’t really know that at the time. Today we have a Linked Data, Semantic, RESTful, URI-based, highly buzz-wordy solution mostly by accident and through ruthless pragmatism. Instead of embracing the ideas of the Semantic Web at the outset, we arrived at the Semantic Web because it was the only solution. We thought we were traveling down two completely unrelated roads. We started down the first while trying to replace a Java Bean Shell script that copied book content to a few different places. The other road began when we wanted to know what color to make the border of a PDF. The first would lead to an Atom Publishing Protocol server and clients, the second to our modeling all product metadata in RDF and opening that to the public.

As it turns out, the two roads weren’t so unrelated after all. RDF is designed to handle modeling information in a distributed manner and provides the underpinnings for the actual metadata we store, aggregate, and use. AtomPub’s RESTful interface is ideally designed for managing individual chunks of all this distributed data over time and provides programs and people a simple, standard interface for publishing, accessing, and updating it. As we progressed down each path, we were making (often unknowingly) major progress in generating linked data and semantics, the two pillars of the Semantic Web.

The RESTful Road

In 2005, soon after O’Reilly launched a custom book publishing platform, we discovered that we’d deferred a hard question. We didn’t know how to make sure that we could easily add new books as they came down the production pipeline. The canonical representation of nearly all O’Reilly titles is DocBook files. Historically, these DocBook files were scattered across many filesystems, transformed by people using one-off scripts, and arbitrarily transmitted using FTP to other filesystems. We simply didn’t have a way of addressing fundamental questions like “Where is the latest, cleanest copy of a book’s markup?” Tracking down the best representation of a book’s content was a laborious, error-prone task.
Around the same time we ran into this, we noticed Tim Bray’s superb presentations about the then-draft form Atom Publication Protocol. The architecture proposed by RESTful advocates like Bray and embodied by what would become RFC5023 gave us the ability to store an atomic chunk of data, assign it a URI and access and update it through a standard interface.

  • A book’s ”source code“, the DocBook markup
  • The print book, as an ISBN
  • The table of contents
  • A HTML, PDF or other representation generated from the source
  • Whatever Tim O’Reilly or the business folks asked for next

O’Reilly’s SafariU was a business venture that implemented these kinds of transformations of content, but didn’t expose anything but it’s own web browser interface.  When considering how to leverage SafariU’s technologies in the business as a whole, we arrived at this:

This atom:entry is the “latest, cleanest copy of a book’s markup” and its URI is the canonical location for this content. Additionally, the entry provides different views of the content using 17 distinct <link/> elements We had embraced the linked data idea Noun = URI.   Around the same time, we realized that while we needed a way to address various available formats of content, we also required a place to store and maintain our digital assets.   By implementing the Atom Publishing Protocol we established a generic way to maintain our assets, as Nouns, over time.  Now that systems could reliably find and update our content using URIs, it became painfully apparent that we still had a major uphill battle—how to do the same thing for product metadata?

A similar problem existed when dealing with metadata. Distinct applications were completely unintegrated and focused only on the browser and human users. They provided no visibility into their data for other systems.

rdf:isNeat

“Can our PDFs have the same branding and colors as the printed books?” —Marketing Person
“Sure! How hard can it be?” —Innocent Developer

At this point in our journey we have more than 900 titles in the AtomPub repository and addressable by URI. We’ve (unknowingly) hit a significant Linked Data milestone and everything is progressing well. Dynamically creating a PDF from these entries is as easy as running our DocBook-XSL customization for the correct series to produce XSL-FO and then rendering that XSL-FO into PDF. The only problem was discovering which series (In a Nutshell, Animal Guide, Missing Manual) the content fell under. At that point all progress stopped.
Our definitive source of book and product information is the Product Database (67,000+ lines of Perl, C++, SQL, and a dozen other languages). The database and web application has its own home-rolled “XML Format,” as I’m sure many other companies have had. Based directly on the column names from the SQL database, our Book XML was a quick and very dirty way of getting our centralized relational data out into the world as XML. A host of new client applications grew around this new access to product data, but we quickly saw the problems of reusing an adhoc, undefined, schema-generated format. The XML service was also incredibly slow.

<IPFamily>

<Book>
<product_id>5549</product_id>
<parent_product_id>6380</parent_product_id>
<imprint_id>1</imprint_id>
<product_status_id>5</product_status_id>
<product_type_id>10</product_type_id>
<isbn>0596515618</isbn>
<isbn13>9780596515614</isbn13>

<final_date>2003-07-02</final_date> <!-- Actually the day the last QC phase ended -->

...


As you can see from the snippet above, clients had to deal with knowing exactly what imprint 1 (O’Reilly Media, Inc.) and product type 10 (PDF) meant. Each client kept mappings of these magic values in order to make the data understandable. Those mappings broke, of course, whenever new product types and imprints were added. Even more dangerously, because the semantics of the XML were totally unspecified, element names were opaque and sometimes actively misleading. We might have redesigned the format to include more data and added more and more fields to it but this wasn’t an explicitly designed schema, just something generated from the SQL. On the road to exposing this data more cleanly we tried everything. Remodeling the SQL to be more relational didn’t offer much benefit and we still couldn’t tell what the column names meant. Sitting down and trying to write up a data dictionary was a great exercise, but it became out of date almost immediately. We experimented with JSON-based CouchDB prototypes, but those had the same issue as the SQL with missing meaning. Our Subversion repository is littered with Relax-NG, XML Schema, and Schematron documents to create new XML-based format. Somehow they never got finished as we discovered we either had to define everything or try to design for extensibility. We knew we didn’t have the time to create our own Book Metadata Standard. We wanted defined semantics.
There is at least one obvious XML vocabulary for a publisher looking to capture book metadata: ONIX. Unfortunately, the ONIX standard is archaic, with obscure element names like b004 (ISBN) and g343 (PrizeJury, obviously) (Footnote: Yes, these are the short versions and a longer set of names is also allowed. However, many of the most important vendors only support the short versions.) We did consider ONIX for a time, but then we noticed that every vendor we sent ONIX to treated the fields a bit differently. Even with pages and pages of specification there wasn’t any agreement on what elements were important or what they meant. Using ONIX as a format would not solve our semantic deficiency, we still wouldn’t know what the “columns” meant.
In the process of trying to create an XML format we asked a number of people in the company how to find the Publication Date for a book. The answer was surprisingly complex. The value was computed independently by each of the ETL hydras, with subtly different implementations that had evolved with particular client needs. O’Reilly isn’t a huge company with layer upon layer of bureaucracy; most questions can be quickly answered with a chat at a desk or an email to the other coast. Imagine our surprise, then, at the results of the Publication Date poll. Most people were confident that one of five dates was the right date, but disagreed on which of the five it was. Retail Availability Date, Actual In Stock Date, Estimated In Stock Date, etc each had its backers. What was really going on was that we discovered the subtle different needs that each business unit had.  The strategy we could most easily support?  Concensus on a public standard.  As we’ve learned so many times, we needed to go outside the company to find the correct solution. Public standards, specifications, and ontologies could save us from ourselves.
Enter: Dublin Core. We couldn’t define our own format or use the industry standard (ONIX), nor could we agree on what a publication date was. Our only choice was go borrow/steal some other group’s ideas. It turns out that our problems had already been solved by the library community. The Dublin Core Metadata Initiative created standards, guidelines, and examples for storing and sharing basic, essential metadata. We had a way out, here was a group of people who’d already done a great deal of thinking for us.
Of course, they hadn’t done all our thinking for us. Mapping all of our old data into well-designed and well-documented Dublin Core, MARC Relators, FOAF, or any other ontology was going to be hard. So we didn’t do it. Instead we mapped the whole of our old, horrible, ugly mess into an undefined ontology called the “Product Database Legacy Ontology.” We then moved some of the more obvious items like title and author into Dublin Core and waited. Only once we had a proven need for a new data point in real application would we go though the process of researching, defining, cleaning, and moving it into a modern, public ontology. For those following along closely: no, trim color isn’t yet in the public or internal metadata. As it turns out, no one really wanted it. At least, not yet.

All Together Now

Since Gavin’s first frenzied port of product metadata to an RDF model, we’ve been able to negotiate changing requirements, establish data validation and control rules, and bring on new applications with little time spent on data modeling. In other words, meeting our immediate need of a centralized, validatated data store of high agility and performance has paid off several times over in deploying new software systems for the rapidly changing company.
One example of the intertwining of Linked Data and Semantics is our Electronic Media distribution system, which lets customers download ebooks, pdfs, videos and the like. Book descriptions, titles, authors names, cover images even the help text provided on the Electronic Media page is simply linked data, built from RDF relationships. When we want to change the help text or a category label, we change it in one document, and everything else in the RDF graph referencing it changes with in moments as well. Just following links pays off.
Previously, the buttons that let a customer add a book to our shoping cart were generated by a system that used nightly ETLs nicknamed “the sync”. So new products would have to be prepped for release the night before. We gave special care to their timely appearance in the morning. Alas, they frequently did not appear as hoped, as the ETLs that made up “the sync” had to run in a very precise nightly schedule or we had to take manual corrective action. Now, a reasonably simple HTML template bound to the RDF for a book generates “Buy Buttons” in near realtime without an ETL in sight.
The greatest challenge of updating our legacy IT infrastructure hasn’t been replacing the ETLs or synchronization. It’s been achieving consensus on the meaning of data elements. In the past, data maintainers might adjust the title of a book to change how retailers present it. Then our website’s title would change (the next day), and we would have to bring resources to bear on reconciling the meaning of “title.” By using for our title element, we’ve established what to expect from those who change the value. It’s simpler to make sure people enter particular kinds of data, and then ask for help to extend or change requirements for downstream apps. The publicly available ontologies, we hope, will help everyone communicate more effectively about business needs and shared data points. So far the results are encouraging.

In the Public Eye

Having built several of our own applications using our new RDF metadata and our initial linked data APIs, we thought it might be a good idea to let someone else have a crack at it too and see what they made of it. It took us two weeks to develop the O’Reilly Product Metadata Interface, a simple layer on top of the Deli. A caching proxy preserves the reliability needed by our own applications, while a predicate filter prevents private information from leaking to the public. A bit more about how you can access it can be found at http://labs.oreilly.com/opmi.html or you can just dive right in by giving it an ISBN, IE: http://opmi.oreilly.com/product/9780596529260.
Sharing our work with the public forced us to be much more deliberate and rigorous about our data, but also exposed some simple blunders. On the day we launched the service we waited for the praise to come in and finally saw a tweet! Someone is using… Oh wait:

OPMI’s book identifiers aren’t resolvable. Sigh.” —Jeni Tennison

“Of course they’re resolvable,” we thought. “You just have to parse the URN and understand how to pass the URN to… oh, yeah good point.” In the process of implementation, we’d forgotten Tim Berners-Lee’s second rule of Linked Data:

2. Use HTTP URIs so that people can look up those names.

At the start of the process we’d talked about about using some sort of identifier for our products. But that conversation had taken place before we really had all the RDF and Linked Data applications working, so at the time there wasn’t any point nor could anyone see the need for a resolvable identifier. Within a few hours of making the data public, the need became blindingly apparent. Part of embracing “anyone can say anything about anything” is that anyone needs to be able to find the anything they want to talk about. And when you’ve got a statement to make, it’s remarkably handy to be able to quickly find out what else has been said. “I loved urn:x-domain:oreilly.com:product:9780596529260.BOOK” is a bit hard to figure out. “I hated http://purl.oreilly.com/product/9780596529260.BOOK” is a lot better.

Streams, Pools and Reservoirs

by Leigh Dodds
| this article features in Nodalities Magazine, issue 6

As we start to move past the current boot-strapping phase of the semantic web in which we are constructing the web of linked data, its useful to begin discussing what other feature and infrastructure we need in order to support sustainable usage of this huge and growing data set: what services can be offered over linked data? Do we need to consider how to provide quality of service, stability and longevity to the data, or does the sheer scale of the web make these moot points?

In order to answer this question it’s useful to compare the ongoing development of the linked data web with that of the web itself.

A Brief History Lesson

There have been several phases of activity in the development of the web. While in truth, these phases were of different duration, overlapped with one another, and have happened at different rates within different communities, essentially we have gone with the following basic steps.

Firstly we concentrated on just getting stuff on line. The early web was a new medium for document and data exchange and so was at its core a simple publishing device used as a collaborative space between small communities. But as the amount of content and the size and breadth of those communities grew, the emphasis shifted towards linking: tying content together to create, – initially hand-crafted – indexes of the web and knit the available content into a greater whole.

The second, manual linking phase was quickly supplemented by a third phase of automated linking between content: search engines. A search engine is simply a way to quickly create a link-base based on some search criteria. The crawling and indexing of the document web by web crawlers allows users to quickly construct links to content of potential interest.

If we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

The third phase of the web’s development has been triggered by the commoditisation of search and the need for search engines to differentiate themselves and offer additional value-added services. Search engine features are now tailored towards particular uses or types of content (Google Image Search; Google Scholar); offer value-added features that capitalise on the ability for search engines to analyse the structure and traffic flows across the web (PageRank and similar indexing improvements; Google Trends); expanding the audience for content (Google Translate); and enabling community-driven customisation of the search experience (Google Custom Search; Yahoo Search Monkey, etc).

No doubt there will be subsequent phases of development, and the perspective of history will let us tease out common strands of development some of which will already be happening. But if we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

History Recapitulated

There has been RDF data available on the web for many years, used by a limited community of researchers. This slow accumulation of content – echoing the first phase of content publishing on the document web – has been replaced by a rapid increase in data publishing encouraged through the Linking Open Data (LOD) project. By providing clear pragmatic guidance and instructions on how to publish data for the semantic web, that project has enabled us to accelerate our transition through that first content publishing phase. But it has also, crucially, encouraged the linking together of data sets (Phase 2).

This linking has to a great extent been manual. Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

The rapid growth of the Linked Data Cloud means that this “manual” phase will soon be over: there will be sufficient momentum behind the semantic web that increasing amounts of data will become available and no single community will be able (or need) to shepherd its development. The focus will shift towards the subject specific communities who will instead co-ordinate at a more local level. Semantic web search engines will also become a reality.

Semantic Web search engines need to be distinguished from semantically enabled search engines. The latter use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans. A Semantic Web search engine should offer infrastructure for machines. This Third Phase is also beginning to take place. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct link bases, based on some simple expressions of what data is of relevance, in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents.

Co-Evolution of the Web Infrastructure

Given the strong concordance between the phases of development of the document and linked data web, it is reasonable to make some predictions on how semantic web search engines, and additional supporting infrastructure, is likely to evolve by comparing them with the development of human search engines. For each of the specialisations and value-added features listed earlier its possible to see an equivalent for the machine-readable web:

Document Web Semantic Web Infrastructure Description
Google Image Search Type Searching Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate Vocabulary Normalisation Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search Community Constructed Data Sets and Indexes Ability to create and manipulate custom subsets of the linked data cloud
Google Trends Linked Data Analysis & Publishing Trends Identifying new data sources; new vocabularies; clusters of data; data analysis

These last two are particularly interesting as they suggest the need to be able to easily aggregate, combine and analyse aspects of the linked data cloud. This infrastructure will need to be able to support the community in working with data in a variety of ways, allowing data to flow and be collected where it is needed. Introducing a metaphor for this process might help highlight some of the processes and its consequences.

Flowing Data

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir.

Data is like water and flows of data are like streams. These streams of data can arise from any number of different sources: from a person entering data into a system; from a click stream generated as a side-effect of web browsing; application events; or generated from real-world sensor measurements. There are already many ways that we can tap into these data streams, using web-based query APIs, messaging systems like XMPP, or syndication protocols like Atom and RSS.

While these streams of data are already supporting a huge range of different applications and use cases, they are inherently limited: a stream has no memory. If historical context is required, e.g. to support more complex querying and reporting, then each consuming application must collect and store the data. We can think of these collections of data as pools; each stream of data on the web may feed any number of different application-specific pools.

A pool of data provides extra flexibility, but comes at the cost of requiring each consuming application to maintain its own infrastructure to hold copies of that data. Even if each source of data provides direct access to its own pool, e.g. by exposing a web-based query interface onto its database, or by exposing linked data, there are still unnecessary overheads. Each data provider must provide their own scalable infrastructure and support a rich set of data access options.

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir. A reservoir is a pool of data that is maintained by and services a specific community. Reservoirs allow issues such as quality of service (reliable supply of water) and infrastructure costs (building of pipelines) to be solved at a community level.

Its possible to argue that the web already consists of streams, pools, and reservoirs, but there is a distinct difference between a web based on semantic web technology and a Web constructed of a mixture of XML documents or similar formats: like water, at the molecular level, all RDF is the same; its all triples. Unlike alternatives, RDF data is more easily pooled and collected and so is much more amenable to explorations of shared infrastructure. Like a relational database, an RDF triple-store can contain an huge variety of different kinds of data. But unlike a relational database, an RDF triple-store, has the potential for the aggregate to be much more than the some of its parts. The seeds of convergence are built in, through reliance ah the most fundamental level on a global naming system (URIs) and standardised ways to state equivalence and relationships between resources.

In the real world, reservoirs do more than supply a community with water. The aggregate has its own uses: water skiing or hydro-electric power generation for example. And the same will be true of semantic web data reservoirs: large collections of data can be analysed and re-purposed in ways that are not possible – or at least not achievable without a great deal of repeated, redundant integration effort – using other techniques. The reservoir itself can be the source of new facts and new streams of data derived from analysis of its contents.

Flowing Data through the Talis Platform

The goal of the Talis Platform is to support the growth of the Linked Data ecosystem by providing the infrastructure to support the creation of pools of data. For additional background, see my article “Enabling the Linked Data Ecosystem” from Nodalities issue 5.

At present the Platform provides a range of services that allow data to be easily streamed into and out of Platform stores, allowing data to be easily pooled in order to benefit from greater context. Data can be pushed directly into the Platform and we are exploring methods of supporting other forms of data ingestion to make it easier and more natural to begin to accumulate data sets within the Platform.

The core search service, which produces its results in RSS, allows the creation of simple data streams, while the SPARQL interface supports more complex data extraction methods. The Augmentation service provides an interesting twist on these conventional approaches, providing a means for any RSS 1.0 feed to be automatically enriched with extra metadata by feeding it through a Platform data store. This means of interaction is like fishing for data: it is possible to serendipitously find and extract data, capturing it as extra context to items in an RSS feed, without having to deal with writing SPARQL queries or constructing a keyword search. There are many more methods and modes of data extraction that will be added to the Platform to add to these existing services; this is just the beginning.

But the Talis Platform is intended to provide much more than just the ability to work with pools of data. The bigger vision is to support the creation of true data reservoirs, and enable many different ways of manipulating and analysing their contents in order to discover new facts and bring new context to that data. Creation of these larger pools of content will need to be made sustainable for the communities that are creating them, and deriving value from them. Sustainability covers a wide range of issues that go beyond just commercial issues: quality and range of services are additional factors, as are forms of governance, trust and quality that relate to the data sets themselves. The Platform is intended to address all of these issues.

To take a small example, the experimental “store groups” feature that was released at the end of last year, provides a simple method for combining datasets, without requiring that data to be completely loaded or copied into a single database. The store groups feature will ultimately support a range of services over the constituent data sets, allowing each pool of data to remain intact whilst still contributing to the whole; this will be important to support the new forms of governance that are beginning to emerge around datasets on the Linked Data web.