Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Author Archive

European Summer School

Talis is delighted to be one of the sponsors of the 8th European summer School on Ontological Engineering and the Semantic Web (SSSW 2011). There will be more about this in coming posts, but just to start off:

We are sponsoring it for a very simple reason. The mix of theoretical, practical and collaboration skills used by all the students involved from across Europe directly corresponds to how we work at Talis. It’s an environment of support and challenge, contribution and connection that has proved beneficial for all involved over the years. Talis is proud to contribute and participate to further the aims of the community.

Talis is a small and ambitious company of likeminded, motivated people. A phrase we often use here is Human Scale. Culturally what we mean by that is we like working closely with people who we all know, whether as employees of Talis or (more likely) over time collaborating as partners in joint endeavours.

We want to grow our company and contribute to the communities we belong to. We know that it is by fostering relationships with others driven by the same passion to collaborate and learn that we can build on the ambitions we have for ourselves and for the communities we belong to. One particular aspect of the Summer School is this same notion of social connectedness, a personal network of trusted relationships that challenge and enhance the experience for everyone.

LOD Around-the-Clock (LATC)

Guest post by Lin Clark and Michael Hausenblas, DERI

LATC Project logo In this, the Petabyte Age, technologists have a growing obsession with data—Big data. But data isn’t just the province of trained specialists anymore. Data is changing the way scientists research and the way that journalists investigate; the way government officials report their progress and the way citizens participate in their own governance.

The challenge that all of these accidental technologists face is how to surface data and bring data together in meaningful ways. As Google’s chief economist Hal Varian has said, the scarce factor is no longer the data, which is essentially free and ubiquitous, but now the “scarce factor is the ability to understand that data and extract value from it.

The emerging Web of Linked Data is the largest source of this data—multi-domain, real-world and real-time data—that currently exists. As data integration and information quality assessment increasingly depends on the availability of large amounts of real-world data, these new technologists are going to need to find ways to connect to the Linked Open Data (LOD) cloud.

With the explosive growth of the LOD cloud, which has doubled in size every 10 months since 2007, utilising this global data space in a real-world setup has proved challenging; the amount and quality of the links between LOD sources remains sparse and there is not a well-documented and cohesive set of tools that enables individuals and organisations to easily produce and consume Linked Data.

A new project aims to change this, making it easier to connect to the LOD cloud by offering support to data owners, Web developers who build applications with Linked Data, and small and medium enterprises that want to benefit from the lightweight data integration possibilities of Linked Data.

LATC to the Rescue

The new LOD Around-the-Clock (LATC) project kicked off on September 13-14, 2010 at the Digital Enterprise Research Institute in Galway, Ireland. LATC brings together a team of Linked Data researchers and practitioners from DERI (National University of Ireland Galway), Vrije Universiteit Amsterdam, Freie Universität Berlin, Institut für Angewandte Informatik, and Talis.

This team will support the production and consumption of Linked Data by providing:

  1. A recommended tools library for publishing and consuming Linked Data, supplementing documentation for the tools, and free implementation support for large-scale data publishers and consumers. Tools include the D2R Server for publishing relational databases on the Semantic Web, the Drupal CMS and related publishing and consupmtion tools, and others.
  2. A 24/7 interlinking platform (see Fig. 1) that acquires new data and creates links between existing datasets in the LOD cloud.
  3. Publication of new large-scale LOD datasets with data from governmental departments and other organizations. The focus will be on EU level datasets such as CORDIS, the European Patent Office, and Eurostat.

LATC Structure Diagram

Homepage:
http://latc-project.eu/
Twitter: @latcproject
Duration:
09/2010- 08/2012
Total cost: 1.19 M€
EU contribution: 1.06 M€
Further information:
Dr. Michael Hausenblas
IDA Business Park, Galway, Ireland
Tel. +353 91 495730
michael.hausenblas@deri.org

In addition to the core team, a large Advisory Committee with more than 30 members will participate in the LATC activities and connect the Linked Data community to LATC’s recommended tools library and support services. Organizations on the Advisory Committee are entitled to support from the project and thus will be in a position to give feedback to improve the support services. The Advisory Committee includes governmental organisations such as the UK Office of Public Sector Information and the European Environment Agency; researchers and practitioners such as the University of Manchester, University of Economics Prague, Vulcan Inc., CTIC Technological Center, the Open Knowledge Foundation; and standardisation bodies, including W3C (Tim Berners-Lee). The LATC partners will also liaison with other EC projects and related activities, including LOD2, PlanetData, SEALS, datalift.org, Semic.EU, OKFN, and the Pedantic Web group.

LATC organises and supports a number of community events, including tutorials at the International Semantic Web Conference 2010 in Shanghai, China, as well as the Open Government Data Camp, London.

LATC is a Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

DataIncubator: What Is It and What's In It?

by Leigh Dodds

| this article first appeared in Nodalities Magazine, issue 8

The Linking Open Data project has had a huge amount of success in bootstrapping the burgeoning Linked Data cloud. There’s now a definite sense of momentum behind the project, and a growing number of organisations are now seriously investigating how their data could further enrich the growing Semantic Web, and how the underlying technologies may help them to innovate and explore new opportunities.

The Linked Data community has rightly begun to look at the next round of challenges: What can we do with all this data? How can it be pressed into service to create new applications? What kinds of frameworks do we need to support consumption of Linked Data? But it is important that we shouldn’t lose sight of the fact that there’s still a huge amount of evangelism to be done and a great deal of data that could and should be part of the web of data. The Linked Data landscape is still not fully mapped out. In short, we need to keep up the process of accumulating, converting, publishing and linking data in as many different subject areas and disciplines as possible.

To date, the bootstrapping process has been supported by a number of community lead projects that convert and re-publish datasets to bring them into the web of data. The recently founded DataIncubator project (http://dataincubator.org) aims to adopt this same “show don’t tell” approach, but with the addition of some best practices and with an eye on long term sustainability.

Sustainability, Repeatability, Reusability

A key goal of the project is to lightly formalise the way these dataset conversions are carried out to make sure they are sustainable, repeatable, and reusable. But why are these particular aspects important?

Firstly, lets consider sustainability. As usage of the Linked Data cloud grows, we need to make sure that new data being added isn’t going to disappear later—e.g. because a small project website goes offline; or because the original project owner loses interest. It is critical that as serious applications begin to be built against this data that consumer can rely on it. One of the primary ways the project is ensuring sustainability is through making use of the Talis Connected Commons scheme (http://www.talis.com/cc). All of the public domain datasets that are converted and published through the DataIncubator project site are being hosted in the Talis Platform. This takes full advantage of the free data hosting offered under the Connected Commons initiative. Talis is therefore contributing to the sustainability of that data.

The second aspect to consider is repeatability. The first goal is to make sure that the data conversion process is itself repeatable—that is: we can easily re-generate the data to allow for modelling changes, bug fixes, and the ingesting of new data. And not just now when a project is active, but in three years time when the project may be picked up and extended by a number of other contributors. Ensuring that each of the incubated datasets is supported by open source code makes this more achievable. Ideally, the original dataset owners will be convinced by the benefits long before a project goes stale, but it’s important to recognise that evangelism can take time and that different industries move at different speeds. There are already a few Linked Data and RDF projects on the web that model and re-publish the same basic dataset in other ways. By trying to build a community around curating the conversion of a dataset and not just the data itself, DataIncubator hopes to avoid these issues.

The final aspect is one that is often over-looked: how can the original dataset owner build on what the community has created? How can the community’s efforts by reused? Reusability is enabled by ensuring that the conversion code is open source and that schemas and modelling design decisions are well documented. This can lower the barrier to entry facing data providers or publishers looking to embrace Semantic Web technology. This is the case particularly where the data conversion is acting on source data(e.g. open, but not linked data). In this case, the data owner may merely need to re-run the data conversion and publish the Linked Data through their own site rather than DataIncubator. This makes adoption much, much easier.

Community Norms

Alongside addressing these procedural aspects of the data conversion process, the DataIncubator project also encourages a number of useful community norms that will hopefully improve the quality of the converted datasets.

The first of these is to ensure that there is a sufficient amount of both linking and attribution. Every dataset within the umbrella project should reference its original sources. This should not take place just at a high-level, such as within in the corresponding Void description: http://rdfs.org/ns/void/. Instead, references should be deeper so resources can be associated with, for example, the original web pages that describe them. This ensures that there is a clear path back to the original source of the data. Attribution—in various forms—is an important community norm in its own right, but it is especially important in the context of converting and re-publishing an existing dataset. We want to ensure that the original curators of the data don’t think that the community is trying to appropriate or steal its work. Quite the opposite, we want them to embrace it.

The other norm relates once again to sustainability. Links to the data should be stable, but how do we achieve this if the data will ultimately be removed from the DataIncubator site and moved to another domain? The proposal here is that as data is migrated to its permanent home, redirects will be put into place to ensure that web browsers and semantic web agents can follow the links to their primary source. Every effort will be taken to ensure that links don’t break.

What’s In It?

The DataIncubator project already has a wide range of datasets available:

There’s a lot more that could yet be added to this list. My personal wishlist includes a conversion of the Prelinger Archives (http://www.archive.org/details/prelinger). This is hosted as part of the Internet Archive project and consists of over 2000 industrial, educational, travel, and propaganda videos published from 1903 to the 1970’s. The content is completely within the public domain, so it’s just begging to be converted. It would also be a great dataset on which to explore the modelling of media and media annotations in general.

Currently, one domain with very little Linked Data is gaming, in all of its forms. For example there is a vast amount of community curated data about Lego, Lego sets, and Lego models. And what about all of the facts and figures that are routinely collected around online gaming? Data might be available through specific community websites, but what could be built if the data were more open, allowing the community to analyse and re-present this data in new ways?

It strikes me that games and gaming is an area that is ripe for exploration. There are many interesting dimensions to the data, and the communities are very engaged. Many gamers are typically very interested in statistics and data about the games they play. This is just one area of the Linked Data landscape that the DataIncubator project is hoping to help explore.

The Greatest Challenge Facing IT

by Lee Feigenbaum and Mike Cataldo

|This article features in Nodalities magazine, Issue 7

As the old adage goes: Time is money.

Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.

anzo-on-the-web-1A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.

This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.

Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.

Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.

Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.

Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.

In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.

The solution is to find data by its meaning rather than its location

The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.

anzo-on-the-web-2With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.

Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.

Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.

Data Collaboration and the Semantic Web

It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:

  • Flexible, define-as-it-arrives, data structures
  • Explicit relationships that travel with the data
  • Data that is accessed by its definition rather than its address
  • Distributed query

As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.

Practical Data Collaboration to Solve IT’s Challenge

Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:

  1. Virtualizes the data so that it is accessible by its description regardless of location.
  2. Lets users create their own views of data.
  3. Fills in the views by traversing the fabric and picking out the relevant information.
  4. Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.

The Right Data…

anzo-for-excel-1At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.

But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.

…For The Right People

Getting the data in front of the right people relies on three things: context, security, and “reach”.

Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.

Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.

Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.

…At The Right Time

Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.

First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.

Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.

Data Collaboration in the Days to Come

Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.

What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.

Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.

Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.

Enhanced by Zemanta

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

RDFa and Linked Data in UK government web-sites

By Mark Birbeck

| This article will feature in Nodalities Magazine, Issue 7

The UK government’s Central Office of Information had a straightforward problem to solve: how could they create a centralised web-site of information that the public could search and access, when the source of that information could be any government department
database or any public sector web-site?

For example, different organisations, such as Her Majesty’s Revenue and Customs (HMRC) or the National Health Service (NHS) would each post job vacancies to their own web-sites, but there was no central site that the public could go to, to find all public sector vacancies. This would be a problem at any time, but in the midst of attempts by the government to help people through the recession, it’s crucial to ensure that the public knows what vacancies are available. It might not occur to someone looking for a job as a plumber or an electrician they they should visit the NHS or Army web-sites, so a centralised site could make a big difference.

civil-service-vacancy

Similarly, as in most modern democracies, government departments are constantly seeking feedback from the public and interested parties, about specific issues. But as with job vacancies, these consultations are on departmental sites, rather than being available on a central site; from the Department of Energy and Climate Change (DECC) seeking feedback on clean coal, to the Ministry of Justice (MOJ) providing an opportunity for people to comment on prisoners’ voting rights, each department manages its own publication of consultations.

Traditional solutions

Traditional answers to these problems would have been to either (a) impose on each of the departments that they should key their data directly into a new central database (which would in turn drive the central web-site), or (b) create complex communication pipelines that would allow the decentralised databases to communicate with the central system.

And either of these solutions would almost certainly have turned out to have been a non-starter.

The first solution was unlikely to ever get off the ground, because it would have required each department to replace their existing technology with something new. Even if there was agreement on what that technology should be—and that in itself could take an age to resolve—there would have been a need for new development work, retraining of users, porting data from older systems, and so on.

The second ‘traditional’ solution at least has the merit of keeping existing systems intact, but would have required additional interfaces to be created to move the data from the departmental servers to the centre; each department would have had to create an interface between their own system and the central one.

Just getting one department into a situation where they could centralise their information would have been a major undertaking—not only were there lots of departments to consider, but each department was using a different technology to publish their vacancies or consultations to the web. For example, some departments with only a small number of job vacancies would likely use static HTML pages. Other departments, perhaps with larger IT departments, might use ASP.NET or a Java-based system.

Enter RDFa

The RDFa answer to this set of problems is simple—both conceptually, and to implement.

RDFa allows HTML publishers to embed RDF into their pages, so using the HTTP and HTML infrastructure to publish their information. This simple method of publishing data in turn means that any system can import this data, just by obtaining (or creating) an RDFa parser.

In short, each department can keep their own data management system, and simply add code to their existing web-page publishing step to augment the HTML with the data as RDFa. The central system in turn only needs one import mechanism—something that understands RDFa.

Adding this facility to an individual departments publishing system proved to be very quick and straightforward. But it’s not just UK government departments that are finding it straightforward to add RDFa to their pages. It was interesting to hear at SemTech in June that Google’s rich snippet launch partners (such as Yelp), were able to add RDFa support in “roughly a day”.

RDF publishing techniques

Adding data to web-pages might seem quite an obvious technique, but there are two important things to note here.

First, the COI has to be commended for having the vision to publish RDF at all. Of course, now that Gordon Brown has asked for Sir Tim Berners-Lee’s help in making government data publicly available, it seems pretty obvious—indeed it may even become fashionable! But the COI were planning this project at least a year ago, and at that time RDF was by no means a done deal (and you could say it’s still not).

But the second important thing is that even after deciding to publish RDF, it’s still not immediately obvious that the solution should involve RDFa, especially not a year ago.

The usual means of publishing RDF is to provide a distinct source of data in the form of RDF/XML (and perhaps other formats, too, such as N3). If there is an HTML version it usually exists for the purpose of describing the data itself. In other words, the RDF/XML format is primary, which means that anyone who is publishing HTML pages but wants to publish RDF as well, will need to add an extra piece of infrastructure that exists alongside their web-pages.

RDFa turns this on its head, and says that the HTML page is the data. One and the same page can be read as an HTML page, or as an RDF page, which in turn means that the changes required to the existing publication system are minimal. The COI once again showed its far-sightedness by adopting this technique.

Turtles all the way down

searchmonkey-fcoBut the benefits of RDFa don’t just stop there. Firstly, because the data is being published via HTTP and HTML, it’s possible for anyone to read the same data, not just the centralised web-site that was being planned. This means that third party job vacancy sites, for example, could import vacancies from relevant departments, to add to their databases. In fact, one of the main drivers for the consultations project was to try to help improve the accuracy of an already existing web-site (set up by a member of the public) that used ‘screen-scraping’ to try to keep up with the available consultations—RDFa provides much more accurate information.

rdfa-in-govIn addition, the centralised web-site will not only import RDFa but publish it too. This means that third-party servers are also able to import some or all of the centralised data, into their own sites.

And thirdly, by using RDFa the sites could provide information to search applications such as SearchMonkey.

As more servers both consume RDFa from one set of servers, and publish RDFa again to a variety of other servers, we enter the exciting world of Linked Data, and it’s ‘turtles all the way down’.

Conclusion

By using RDFa to address the challenge of making distributed data available in one place, the COI avoided having to make changes to each department’s systems. But once each department is publishing RDFa, it becomes possible for third parties to consume that information however they see fit. Such a flexible architecture is crucial in the age of open government, and is a cornerstone of linked open data.

Mark is managing director of Backplane Ltd. (http://webBackplane.com/), a London-based company involved in a number of RDFa/linked data projects for UK government departments. He is the original proposer of RDFa.

Might Semantic Technologies permit meaningful Brand relationships?

| This post will appear in Nodalities Magazine, Issue 7

by Paul Miller

Much has been written about growing Enterprise use of social media (usually Twitter, these days) to successfully track and mitigate customer complaint. Many have been quick to spot that the disproportionately high cost of satisfying (or, more cynically, silencing) these early adopters is unlikely to scale effectively as an increasingly large cohort of customers move onto these services, and it must remain an open question as to whether ComcastCares and its peers can survive any move to the mainstream in recognisable form.

It appears, though, that Enterprise engagement in the social sphere changes the game far more significantly than merely enabling a select few twitterati to jump the Customer Support queue, and that this change is worth effort and investment in order to ensure that it does scale. What’s actually happening is that a relationship is being enabled between a brand and those that Seth Godin might recognise as its tribe; a relationship in which interactions are no longer driven predominantly by the desire to seek redress. Rather than only raising those issues serious enough for us to have written letters or endured telephone muzak in the past, we now comment on issues at the periphery of a brand. Collectively, we’ve moved from simply complaining about the worst failures of companies, their products and their employees, toward emitting an impressive stream of FYIs. Individually insignificant, and possibly unimportant, together these light touches on and around a brand build into an ever-changing and valuable commentary that brands and the corporations they front would do well to take notice of. The minor niggles about an otherwise exemplary service, the human touches that made us smile, the odd inconsistencies in a polished persona; none are enough to make us pick up the phone, but we comment upon them endlessly in Twitter, Facebook, FriendFeed and elsewhere, and by tapping into this fundamentally honest stream of consciousness there is much for those about whom we comment to learn. Good companies probably already know about fundamental failings in a product long before their customer support operation melts down under the weight of complaints or their quarterly sales targets are seriously under-achieved. Do they have as good a handle on the things we love? Do they have a clue about the minor gripes of customers outside their pre-launch polling groups? Do they know about the gut reaction to a colour, a touch, a smell, or a careless word that persuaded a likely prospect to buy a technically or aesthetically inferior product from the competition instead? All this and more is there for the taking in the stream of online chatter freely directed their way.

Semantic Technologies aren’t often directly associated with the worlds of Marketing and Commerce, yet individuals such as Eric Hillerbrand and Scott Brinker are hard at work to show just what might be possible when the experiences of the Semantic Web are applied to this space. Brands are no longer owned by the companies in whose name they were created. Increasingly, ownership of various forms is being asserted by the multitude of stakeholders with effort and attention invested in the brand. They care about it, they care about what it says about them, and they play a clear role in the brand’s evolution whether its managers want them to or not.

Brands need to engage in this conversation, as we are beginning to see them do, but they also need to discover the means to cost-effectively monitor and engage with a potential flood of third party reaction whilst using the Business Intelligence tools available to them in nimbly shaping public opinion to their advantage wherever possible.

I spoke with Scott Brinker last year, to explore his—then nascent—views on Semantic Marketing, and look forward to hearing his latest thoughts at the Semantic Technology Conference in San Jose in June.

More recently, Eric Hillerbrand talked about some of his ideas with respect to ‘Social Commerce,’ and the ways in which commercial organisations might seek to strengthen and exploit relationships with their customers, aided by a range of semantic technologies.

We’re just beginning to grasp the realities of a world in which tightly controlled and fiercely guarded brand attributes become increasingly permeable. For those companies with the confidence and foresight to loosen their grip, whilst simultaneously exploiting the wealth of data and new opportunities to engage, there is much to be gained. For the dinosaurs that hang on to ‘their’ brand in spite of the world around them, there is everything to lose.

Interesting semantic web stuff

By Tom Scott
| This guest post originally appeared on Tom Scott’s blog; republished under CreativeCommons License, and with kind permission of the author.

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.

TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:

  • overseeing the creation of a single online point of access and work with departments to make this part of their routine operations.
  • helping to select and implement common standards for the release of public data
  • developing Crown Copyright and ‘Crown Commons’ licenses and extending these to the wider public sector
  • driving the use of the internet to improve consultation processes.
  • working with the Government to engage with the leading experts internationally working on public data and standards

The Guardian has an article on the appointment.

Closer to home there have been a few interesting developments

Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.

The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.

A social semantic BBC? [slideshare]
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.

PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
Including an interview with me!

You should also check out…

sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.

Enhanced by Zemanta

Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr

Linking Data and Semantics at O’Reilly

By Gavin Carothers and Charles Greer

|This article features in Nodalities Magazine, Issue 6

O’Reilly Media lives on the cutting edge. We coined terms such as Web 2.0, created the first commercial website in 1993, and exist to “spread the knowledge of innovators.” With our evangelists, conference presenters, authors, and bloggers all communicating and catalyzing new ideas, many believe that O’Reilly must be just as technologically innovative in our own operations. However, O’Reilly employs about 200 people but only half a dozen developers, so naturally ideas are thrown at our developers faster than it is possible to implement them. We’ve been known to refer to this tension between our public position on the cutting edge and internal expectation to live up to what we preach as “gaping wound tech.” Any time someone had a new idea or a new product to launch that didn’t quite fit into existing systems, we found some way to shoehorn it in, with a quick Perl script or some clever custom SQL. As we did this, more and more of our work became preventing our systems from collapsing under the weight of those one-off ETLs and scripts. The cost of simply keeping track of which scripts were using what bit of transformed data and where that data came from had became so high as to become unsustainable. We’d accrued so much design debt that only the most radical of approaches could save us from being crushed by the weight of our inherited code.

Of course, we didn’t really know that at the time. Today we have a Linked Data, Semantic, RESTful, URI-based, highly buzz-wordy solution mostly by accident and through ruthless pragmatism. Instead of embracing the ideas of the Semantic Web at the outset, we arrived at the Semantic Web because it was the only solution. We thought we were traveling down two completely unrelated roads. We started down the first while trying to replace a Java Bean Shell script that copied book content to a few different places. The other road began when we wanted to know what color to make the border of a PDF. The first would lead to an Atom Publishing Protocol server and clients, the second to our modeling all product metadata in RDF and opening that to the public.

As it turns out, the two roads weren’t so unrelated after all. RDF is designed to handle modeling information in a distributed manner and provides the underpinnings for the actual metadata we store, aggregate, and use. AtomPub’s RESTful interface is ideally designed for managing individual chunks of all this distributed data over time and provides programs and people a simple, standard interface for publishing, accessing, and updating it. As we progressed down each path, we were making (often unknowingly) major progress in generating linked data and semantics, the two pillars of the Semantic Web.

The RESTful Road

In 2005, soon after O’Reilly launched a custom book publishing platform, we discovered that we’d deferred a hard question. We didn’t know how to make sure that we could easily add new books as they came down the production pipeline. The canonical representation of nearly all O’Reilly titles is DocBook files. Historically, these DocBook files were scattered across many filesystems, transformed by people using one-off scripts, and arbitrarily transmitted using FTP to other filesystems. We simply didn’t have a way of addressing fundamental questions like “Where is the latest, cleanest copy of a book’s markup?” Tracking down the best representation of a book’s content was a laborious, error-prone task.
Around the same time we ran into this, we noticed Tim Bray’s superb presentations about the then-draft form Atom Publication Protocol. The architecture proposed by RESTful advocates like Bray and embodied by what would become RFC5023 gave us the ability to store an atomic chunk of data, assign it a URI and access and update it through a standard interface.

  • A book’s ”source code“, the DocBook markup
  • The print book, as an ISBN
  • The table of contents
  • A HTML, PDF or other representation generated from the source
  • Whatever Tim O’Reilly or the business folks asked for next

O’Reilly’s SafariU was a business venture that implemented these kinds of transformations of content, but didn’t expose anything but it’s own web browser interface.  When considering how to leverage SafariU’s technologies in the business as a whole, we arrived at this:

This atom:entry is the “latest, cleanest copy of a book’s markup” and its URI is the canonical location for this content. Additionally, the entry provides different views of the content using 17 distinct <link/> elements We had embraced the linked data idea Noun = URI.   Around the same time, we realized that while we needed a way to address various available formats of content, we also required a place to store and maintain our digital assets.   By implementing the Atom Publishing Protocol we established a generic way to maintain our assets, as Nouns, over time.  Now that systems could reliably find and update our content using URIs, it became painfully apparent that we still had a major uphill battle—how to do the same thing for product metadata?

A similar problem existed when dealing with metadata. Distinct applications were completely unintegrated and focused only on the browser and human users. They provided no visibility into their data for other systems.

rdf:isNeat

“Can our PDFs have the same branding and colors as the printed books?” —Marketing Person
“Sure! How hard can it be?” —Innocent Developer

At this point in our journey we have more than 900 titles in the AtomPub repository and addressable by URI. We’ve (unknowingly) hit a significant Linked Data milestone and everything is progressing well. Dynamically creating a PDF from these entries is as easy as running our DocBook-XSL customization for the correct series to produce XSL-FO and then rendering that XSL-FO into PDF. The only problem was discovering which series (In a Nutshell, Animal Guide, Missing Manual) the content fell under. At that point all progress stopped.
Our definitive source of book and product information is the Product Database (67,000+ lines of Perl, C++, SQL, and a dozen other languages). The database and web application has its own home-rolled “XML Format,” as I’m sure many other companies have had. Based directly on the column names from the SQL database, our Book XML was a quick and very dirty way of getting our centralized relational data out into the world as XML. A host of new client applications grew around this new access to product data, but we quickly saw the problems of reusing an adhoc, undefined, schema-generated format. The XML service was also incredibly slow.

<IPFamily>

<Book>
<product_id>5549</product_id>
<parent_product_id>6380</parent_product_id>
<imprint_id>1</imprint_id>
<product_status_id>5</product_status_id>
<product_type_id>10</product_type_id>
<isbn>0596515618</isbn>
<isbn13>9780596515614</isbn13>

<final_date>2003-07-02</final_date> <!-- Actually the day the last QC phase ended -->

...


As you can see from the snippet above, clients had to deal with knowing exactly what imprint 1 (O’Reilly Media, Inc.) and product type 10 (PDF) meant. Each client kept mappings of these magic values in order to make the data understandable. Those mappings broke, of course, whenever new product types and imprints were added. Even more dangerously, because the semantics of the XML were totally unspecified, element names were opaque and sometimes actively misleading. We might have redesigned the format to include more data and added more and more fields to it but this wasn’t an explicitly designed schema, just something generated from the SQL. On the road to exposing this data more cleanly we tried everything. Remodeling the SQL to be more relational didn’t offer much benefit and we still couldn’t tell what the column names meant. Sitting down and trying to write up a data dictionary was a great exercise, but it became out of date almost immediately. We experimented with JSON-based CouchDB prototypes, but those had the same issue as the SQL with missing meaning. Our Subversion repository is littered with Relax-NG, XML Schema, and Schematron documents to create new XML-based format. Somehow they never got finished as we discovered we either had to define everything or try to design for extensibility. We knew we didn’t have the time to create our own Book Metadata Standard. We wanted defined semantics.
There is at least one obvious XML vocabulary for a publisher looking to capture book metadata: ONIX. Unfortunately, the ONIX standard is archaic, with obscure element names like b004 (ISBN) and g343 (PrizeJury, obviously) (Footnote: Yes, these are the short versions and a longer set of names is also allowed. However, many of the most important vendors only support the short versions.) We did consider ONIX for a time, but then we noticed that every vendor we sent ONIX to treated the fields a bit differently. Even with pages and pages of specification there wasn’t any agreement on what elements were important or what they meant. Using ONIX as a format would not solve our semantic deficiency, we still wouldn’t know what the “columns” meant.
In the process of trying to create an XML format we asked a number of people in the company how to find the Publication Date for a book. The answer was surprisingly complex. The value was computed independently by each of the ETL hydras, with subtly different implementations that had evolved with particular client needs. O’Reilly isn’t a huge company with layer upon layer of bureaucracy; most questions can be quickly answered with a chat at a desk or an email to the other coast. Imagine our surprise, then, at the results of the Publication Date poll. Most people were confident that one of five dates was the right date, but disagreed on which of the five it was. Retail Availability Date, Actual In Stock Date, Estimated In Stock Date, etc each had its backers. What was really going on was that we discovered the subtle different needs that each business unit had.  The strategy we could most easily support?  Concensus on a public standard.  As we’ve learned so many times, we needed to go outside the company to find the correct solution. Public standards, specifications, and ontologies could save us from ourselves.
Enter: Dublin Core. We couldn’t define our own format or use the industry standard (ONIX), nor could we agree on what a publication date was. Our only choice was go borrow/steal some other group’s ideas. It turns out that our problems had already been solved by the library community. The Dublin Core Metadata Initiative created standards, guidelines, and examples for storing and sharing basic, essential metadata. We had a way out, here was a group of people who’d already done a great deal of thinking for us.
Of course, they hadn’t done all our thinking for us. Mapping all of our old data into well-designed and well-documented Dublin Core, MARC Relators, FOAF, or any other ontology was going to be hard. So we didn’t do it. Instead we mapped the whole of our old, horrible, ugly mess into an undefined ontology called the “Product Database Legacy Ontology.” We then moved some of the more obvious items like title and author into Dublin Core and waited. Only once we had a proven need for a new data point in real application would we go though the process of researching, defining, cleaning, and moving it into a modern, public ontology. For those following along closely: no, trim color isn’t yet in the public or internal metadata. As it turns out, no one really wanted it. At least, not yet.

All Together Now

Since Gavin’s first frenzied port of product metadata to an RDF model, we’ve been able to negotiate changing requirements, establish data validation and control rules, and bring on new applications with little time spent on data modeling. In other words, meeting our immediate need of a centralized, validatated data store of high agility and performance has paid off several times over in deploying new software systems for the rapidly changing company.
One example of the intertwining of Linked Data and Semantics is our Electronic Media distribution system, which lets customers download ebooks, pdfs, videos and the like. Book descriptions, titles, authors names, cover images even the help text provided on the Electronic Media page is simply linked data, built from RDF relationships. When we want to change the help text or a category label, we change it in one document, and everything else in the RDF graph referencing it changes with in moments as well. Just following links pays off.
Previously, the buttons that let a customer add a book to our shoping cart were generated by a system that used nightly ETLs nicknamed “the sync”. So new products would have to be prepped for release the night before. We gave special care to their timely appearance in the morning. Alas, they frequently did not appear as hoped, as the ETLs that made up “the sync” had to run in a very precise nightly schedule or we had to take manual corrective action. Now, a reasonably simple HTML template bound to the RDF for a book generates “Buy Buttons” in near realtime without an ETL in sight.
The greatest challenge of updating our legacy IT infrastructure hasn’t been replacing the ETLs or synchronization. It’s been achieving consensus on the meaning of data elements. In the past, data maintainers might adjust the title of a book to change how retailers present it. Then our website’s title would change (the next day), and we would have to bring resources to bear on reconciling the meaning of “title.” By using for our title element, we’ve established what to expect from those who change the value. It’s simpler to make sure people enter particular kinds of data, and then ask for help to extend or change requirements for downstream apps. The publicly available ontologies, we hope, will help everyone communicate more effectively about business needs and shared data points. So far the results are encouraging.

In the Public Eye

Having built several of our own applications using our new RDF metadata and our initial linked data APIs, we thought it might be a good idea to let someone else have a crack at it too and see what they made of it. It took us two weeks to develop the O’Reilly Product Metadata Interface, a simple layer on top of the Deli. A caching proxy preserves the reliability needed by our own applications, while a predicate filter prevents private information from leaking to the public. A bit more about how you can access it can be found at http://labs.oreilly.com/opmi.html or you can just dive right in by giving it an ISBN, IE: http://opmi.oreilly.com/product/9780596529260.
Sharing our work with the public forced us to be much more deliberate and rigorous about our data, but also exposed some simple blunders. On the day we launched the service we waited for the praise to come in and finally saw a tweet! Someone is using… Oh wait:

OPMI’s book identifiers aren’t resolvable. Sigh.” —Jeni Tennison

“Of course they’re resolvable,” we thought. “You just have to parse the URN and understand how to pass the URN to… oh, yeah good point.” In the process of implementation, we’d forgotten Tim Berners-Lee’s second rule of Linked Data:

2. Use HTTP URIs so that people can look up those names.

At the start of the process we’d talked about about using some sort of identifier for our products. But that conversation had taken place before we really had all the RDF and Linked Data applications working, so at the time there wasn’t any point nor could anyone see the need for a resolvable identifier. Within a few hours of making the data public, the need became blindingly apparent. Part of embracing “anyone can say anything about anything” is that anyone needs to be able to find the anything they want to talk about. And when you’ve got a statement to make, it’s remarkably handy to be able to quickly find out what else has been said. “I loved urn:x-domain:oreilly.com:product:9780596529260.BOOK” is a bit hard to figure out. “I hated http://purl.oreilly.com/product/9780596529260.BOOK” is a lot better.