Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Nodalities Magazine' Category

Nodalities Issue 12 – now available

Issue 12 of Nodalities is now available for download.

In this issue, some rather practical things that Linked Data is good at solving are being put to use saving lives: quite literally, as Bart van Leeuwen explains in our cover story. Simple ideas joining up public data and GIS devices are helping the Amsterdam fire service get their equipment to the scenes of fires more quickly and safely.

Elsewhere, Martin Belam, an information architect at the Guardian, tells us about their approach to Linked Data, and what it means to them. Talis’ Leigh Dodds outlines some of the challenge and opportunities of Linked Data in an evolving world in his article. Also supporting Linked Data research is the multi-organisational LATC Project which is introduced in this issue. And finally, Tim Hodson discusses a very practical approach to starting with Linked Data, and may also discuss eating an elephant.

You can subscribe to Nodalities for free here and read previous issues here.

European Summer School

Talis is delighted to be one of the sponsors of the 8th European summer School on Ontological Engineering and the Semantic Web (SSSW 2011). There will be more about this in coming posts, but just to start off:

We are sponsoring it for a very simple reason. The mix of theoretical, practical and collaboration skills used by all the students involved from across Europe directly corresponds to how we work at Talis. It’s an environment of support and challenge, contribution and connection that has proved beneficial for all involved over the years. Talis is proud to contribute and participate to further the aims of the community.

Talis is a small and ambitious company of likeminded, motivated people. A phrase we often use here is Human Scale. Culturally what we mean by that is we like working closely with people who we all know, whether as employees of Talis or (more likely) over time collaborating as partners in joint endeavours.

We want to grow our company and contribute to the communities we belong to. We know that it is by fostering relationships with others driven by the same passion to collaborate and learn that we can build on the ambitions we have for ourselves and for the communities we belong to. One particular aspect of the Summer School is this same notion of social connectedness, a personal network of trusted relationships that challenge and enhance the experience for everyone.

Thanksgiving for Open Government

On the eve of the American Thanksgiving holiday, millions of people travel to spend time with friends and family.  Before I share a meal with relatives, I contemplate the connection between the first thanksgiving and the emerging Open Government movement.

The “First Thanksgiving” celebration in the US was a feast shared by 53 starving pilgrims who survived a brutal winter in New England, and 90 Native Americans.  The Native Americans knew how to manage their land and waters to provide sufficient fish, meat, vegetables and fruit.

The connection between the first American Thanksgiving and Open Government has to do with adapting to a new world by sharing information.  Four hundred years ago,  the Native Americans shared information on seeds, crops and planting conditions, helping the pilgrims survive.  Today, sharing information via the Web is helping us to better understand climate conditions, our health care options and issues impacting our local community.

Last week I joined about 250 people at the first International Open Government Conference, hosted by the US Department of Commerce in Washington DC.  Approximately half the conference delegates were from government, the balance from academia and the private sector.  The speakers discussed Open Government projects underway in the US, UK, Australia, New Zealand and Brazil. Speakers shared success stories and areas for future development.  The common theme: democratizing public sector data and driving innovation.  Jonas Rabinovitch from the United Nations Department of Economic and Social Affairs highlighted several eGov strategies in developing nations.  Mr. Rabinovitch noted that all but three UN member nations have a basic Web presence, many offer online forms and some provide the ability to perform transactions via the Web.

Given the conference was hosted in the US Department of Commerce, data.gov featured prominently.  “The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”  Seven countries have stood up Open Government sites in the last 18 months, including UK, US, Australia, New Zealand, Canada and Finland.  Government administrators are seeking to restore public trust and establish an environment of transparency, participation and collaboration with the public.

The US Administration launched its Open Government Initiative in April 2009.  In the last two years, I’ve watched the US Executive Branch begin to move from  a “need to know” to a “need to share” culture.  This cultural transition and thus this Open Government Conference, was truly historic.  The conference underscored to me that we all, regardless of our political views and affiliation, live in a highly  interconnected global economy, underpinned by the World Wide Web.

Respected advisors on Open Government initiatives including Professor Jim Hendler of Rensselaer Polytechnic Institute and Sir Tim Berners-Lee, Director of the World Wide Web Consortium, agreed that public participation and collaboration will be key to the success of Open Government initiatives.  I believe that more conferences like this one and the Open Government Data Camp 2010 held in London last week, drawing delegates from a variety of disciplines, from several countries, will do a great deal to reinvigorate civic engagement and economic growth from the ground up.

Government employees are responding to mandates to publish content to Open Government websites.  Data.gov was launched in April 2009 with 47 data sets.  Vivek Kundra, U.S Chief Information Officer stated that data.gov has in excess of 300,000 data sets as of November 2010.  A large portion of the data.gov data sets are geospatial information which is an opportunity for scientists and entrepreneurs to build tools for analysis and visualization of this valuable data.  The UK Government as published over 4,600 data sets, including many from Great Britain’s national mapping agency, Ordnance Survey, providing the most accurate and up-to-date geographic data for the UK.

“The stakes are high for our interlinked global economy.”  Dr. Robert Schaefer, Deputy Project Scientist from Johns Hopkins University Applied Physics Lab gave a compelling presentation on the need for mechanisms to make sense of published data as Linked Open Data. Publishing the content as in RDF is not sufficient, rather, providing context on what the data implies is necessary.  Better tools for analysts and scientists to extract meaning from Linked Open Data will allow critical information on climate change and space weather, for example, to be more readily understood by policy makers.  Professor Schaefer stated the implications for climate change are serious, wide ranging & urgent.  Current CO2 emissions are higher than the International Panel of Climate Change “worst case” scenario.  Billions of people may experience serious consequences from climate change.  Professor Schaefer reiterated the need to get started as soon as possible.  “When the water from the sea rises, millions of people will have to move.”  This international conference will hopefully stimulate cooperation between the public and private sectors.  It is a critical step in making data accessible and providing decision support tools for space weather and climate change.

Mr. Kundra acknowledged we have much more to do to improve the quality of published data sets.  He said, “when I’m able to perform analytics on the fly, grounded on quality data, we will have achieved success.”  Delegates were encouraged by Mr. Kundra and  other speakers to build out communities of interest, lead by individuals, rather than government agencies. The US Government is regularly launching challenges, see http://www.challenge.gov, with modest cash prizes targeting citizens to gain insights on how we, the people, not government, can solve problems ranging from education on childhood obesity to sustainable urban housing that respects the environment.

Beth Simone Noveck, United States Deputy Chief Technology Officer for Open Government, leads President Obama’s Open Government Initiative.  Based at the White House Office of Science and Technology Policy, she is an expert on technology and institutional innovation. Ms. Noveck stated that “the Open Government Initiative is not transparency for transparency’s sake.  It is through participation and collaboration with academia and the public sector that there is value.”  Creating partnerships to use Open Government Data for important and unforeseen uses is empowering individuals with the ability to make better decisions and affect our quality of life.

We are in the very early stages of making Open Government available as Linked Data. Today, we are in the very early phases, however,  there are many good reasons to support Open Government initiatives including accountability in spending, improved health care provision, and addressing climate change and space weather which affects the world’s population.   The international data exchange standards are in now in place.  While experts will continue to refine the technical underpinnings and best practices will evolve, the citizen lead movement, assisted by government, is truly underway.

Bright young geeks are increasingly involved in American civic life through non-profit organizations like Code for America.  Passionate entrepreneurs like Dan Melton show that being being super bright and engaged at a grassroots level in government is both hip and necessary.  Code for America recruited twenty “fellows” from 362 applicants to get involved in city projects in 2011.  One example discussed was the Boston Project whose idea is to bring info on students together & create interesting applications leveraging federal census content, student data, transit info, city and state data.

Each month new mobile applications and social networking solutions are made available.  These are not expensive, government top down initiatives, rather, they are coming from the ground up by military personnel, students, local government officials, publishers, scientists and citizens who value transparent government.  An interesting mobile app for Android, iPhone and the iPad was unveiled for the New York Senate.  It is a real-time constituent mobile dashboard to the legislative process allowing citizens to connect with Senators, find and comment on bills, review votes and transcripts.

Academics are doing innovative research.  Grad students and post-docs are rapidly prototyping what the new world of open data will look like.   An increasingly number of software companies, including my employer Talis, are producing light weight platforms and cloud computing solutions.  Thousands of smart people have been creating the foundation of the Linked Data “ecosystem” in the form of International Data Standards and best practices over the last fifteen years, largely through the important work of the World Wide Web Consortium (W3C).

The availability of improved development tools is seen as a requirement for widespread proliferation of Semantically enabled applications, however,  people are leveraging international standards such as RDF for Linked Data, content sharing models, well-documented licensing models, and existing best practices.  Fully 25% of the applications shipped on a new Apple iPhone use government produced content.

I believe there are significant opportunities for commercial software firms to produce services and products to visualize data sets, find related data sets and most importantly, provide mechanisms as easy to use as the early Web to publish machine and human readable data as Linked Data.  There is burgeoning information economy rapidly forming around provision of public and private data mixed together in novel ways.  I believe that in 2011, truly useful tools for Web developers to create compelling Linked Data applications will be available for use with Open Government data.

We should all acknowledge that data will never be 100% perfect.  Real data is dirty, face it.  Yes, concerns will linger about misinterpretation and inappropriate mashups until people gain experience in making informed decisions based the data presented.  Be patient and don’t expect it to be perfect on day one or even year one.  Allow best practices to emerge from the ground up, by communities of interest.  Issues of data quality, provenance, context and important elements such as units of measure will all be addressed as Linked Data becomes more mainstream.  Harvard Business School published a blue print for use of open government data.  The W3C provides lots of useful guidance on eGovernment and Linked Data activities.

Just as the early American pilgrims experienced miscalculations in weather and agriculture, they eventually they figured out how to plant seeds correctly and increase their potential for a bountiful harvest.  Through information sharing and discussion by informed citizens, the US evolved a free and democratic form of government that is admired by millions of people around the world.

I’m optimistic that the citizens of the world will leverage Open Government initiatives for positive outcomes.  The more our governments support openness and transparency through Open Government initiatives, the more we, the people, can solve issues that matter at the community-level or on a global level.  The stakes are high and we should be grateful and cooperate to harness the power of Open Government data and the Web.  We are defining our history, as well as our future, today.

“Linked Data” at the Guardian

Nodalities Magazine article by Martin Belam.

During October at Guardian News & Media we announced a change in our Open Platform Content API. For the first time, developers and users could query our database of over 1 million content items by using the common external identifiers of a MusicBrainz ID or an ISBN number. It is our first step into the world of ‘Linked Data’.

The Open Platform Content API was launched as a beta in 2009, and earlier this year was launched as a commercial product, allowing partners to re-use Guardian & Observer content in a variety of different ways. There is, for example, a WordPress plugin that easily allows you to include Guardian content in your blog, and developers have built applications like a bespoke recipe search on top of the data. It is a unique proposition amongst news organisations on the web, and as well as the Content API itself, the Open Platform also includes publishing the source data behind Guardian journalism on the Data Store, and providing a search engine for Government datasets from around the world.

Why linked data at Guardian News & Media?

The addition of linked data to the API is the culmination of a great deal of work behind the scenes to get the data prepared, and to work out the right way to make it available. Personally, I had been struck the first time I saw the linked open data cloud diagram that none of the bubbles represented any of the UK’s traditional print news organisations. With our combined centuries of experience sifting, collating, organising and publishing information, it seemed to me that they should in fact be occupying a central position on that map. The principles of linked open data also chime with the over-riding principles we have about our web presence at Guardian.co.uk. We strive to be ‘of the web’, not just on the web. That means reaching out and embracing external services and data, and our intention is to have permanent, predictable URLs for all of our content.

The first challenge to implementing this was to pick stable and reliable external datasets that would form a permanent and meaningful relationship with our content. We decided that a focus on distinct cultural entities would work, and avoided the messiness of trying to decide whether a story was ‘about’ something, or whether it just ‘mentioned’ something. MusicBrainz IDs and ISBN numbers seemed like datatypes we could work with.

The domain model of our content already had a concept of an ‘external reference’ that can be added to a tag or a factbox or an article. We have previously used that to link articles to a page about a specific film, or to link a sports match report to game statistics provided by a third party like Opta. The obvious route was therefore to expose these ‘external references’ in our API

MusicBrainz IDs

musicbrainz ID in the APIWith MusicBrainz IDs, we did not attempt to tag all of our music story archive. There are around 42,000 music content items currently on our site, and to accurately add MusicBrainz IDs to them would be an arduous task. Fortunately, because of our domain model, we had a shortcut to tagging this content. All of the items in our database are given tags. These indicate the type of content (e.g. article, audio, video), the tone of content (e.g. news, comment, review, obituary), the contributor who produced the content, and keywords representing the subject the content is about. In the Music section, we have around 600 of the artists we write about most frequently who exist as keyword tags. The quickest route to adding MusicBrainz data was to add it to these artist keyword tags. The actual job of tagging was achieved via the rather dull mechanism of filling in a Google Docs spreadsheet, although developer Daithi Ó Crualaoich built a tool to help us. He came up with a quick browser-based hack that simultaneously put the same search string across our music tags and across MusicBrainz, and matched the outcome. A script then uploaded this to our database.

ISBN numbers

ISBN numbers were another obvious choice for us. The majority of our book reviews on the web feature a ‘fact box’, giving details of the publication and a corresponding link through to our book store to make a purchase. This ‘fact box’ frequently includes the ISBN number of the publication, and so exposing them as a search criteria was not a massive undertaking. Nevertheless, as with our music content, we do not have universal coverage. At the time of launch around 2,500 reviews out of a possible total of 17,000 had ISBNs attached to them. This is part of the production process now, and so all reviews going forward should have the ISBN added.

API query types

Open Platform API ExplorerThe Open Platform supports a range of ways to query this data, and you can find a guide at: http://gu.com/p/2k6ay. Obviously you can query the API looking for a specific reference, so a query for reference=musicbrainz/05ec70a5-3858-4346-a649-fda0a297b8c1 will return content about Shirley Bassey. Additionally, you can get a list of content which has a MusicBrainz or ISBN attached to it, so reference-type=musicbrainz|isbn will give you content from the API which has a MusicBrainz OR an ISBN added to it. Adding the ‘show-references’ parameter will return a block in your API responses that includes MusicBrainz IDs or ISBN numbers for any item within the list. If you’ve not used the Guardian’s API before, you can get a feel for how it works by using our browser based API explorer.

‘Linked data’ formats

It does seem that as soon as you put the words ‘linked’, ‘open’ and ‘data’ into the same sentence, you automatically invoke a debate about what formats are appropriate to use. At the present time we are making these persistent external IDs available alongside our content items in both XML and JSON formats. And yes, that does mean that we have steered away from RDFa and SPARQL.

From our point of view there is a clear reasoning behind this. We try to work in a lightweight and agile way, and providing the data in this format was the simplest way to meet our immediate requirements. We are trying to concentrate on making more metadata available. If we were to decide to invest in triple-stores and implement a SPARQL endpoint first, then I’d wager that we would still be waiting to dip our toe into the water.

Moreover, it would be wrong to commit our editorial production colleagues to tagging up all our content with this extra layer of semantic data, if we can’t show the benefits. It is my hope that by incrementally releasing extra layers of linked data through our API, in a simple way, we can see what works and what doesn’t, and what types of data interest people and inspire them to develop applications using the data

As I’ve personally argued before, particularly in response to Tom Coates’ recent call for “Death to the Seamntic Web”, I’m entirely agnostic about formats myself. What I think is most important is that we provide consistent, RESTful, predictable, persistent hooks into Guardian.co.uk content, in as many ways as possible, with the right licence for re-use.

What next?

We are now evaluating where else we can add value to our API with joins to external datasets. Again we will aim to be pragmatic—tagging the most amount of data with the least amount of effort. And we also want to listen to the linked data community—what are the data joins that would be most useful to external developers?

Martin Belam is an information architect at the Guardian newspaper.

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

DataIncubator: What Is It and What's In It?

by Leigh Dodds

| this article first appeared in Nodalities Magazine, issue 8

The Linking Open Data project has had a huge amount of success in bootstrapping the burgeoning Linked Data cloud. There’s now a definite sense of momentum behind the project, and a growing number of organisations are now seriously investigating how their data could further enrich the growing Semantic Web, and how the underlying technologies may help them to innovate and explore new opportunities.

The Linked Data community has rightly begun to look at the next round of challenges: What can we do with all this data? How can it be pressed into service to create new applications? What kinds of frameworks do we need to support consumption of Linked Data? But it is important that we shouldn’t lose sight of the fact that there’s still a huge amount of evangelism to be done and a great deal of data that could and should be part of the web of data. The Linked Data landscape is still not fully mapped out. In short, we need to keep up the process of accumulating, converting, publishing and linking data in as many different subject areas and disciplines as possible.

To date, the bootstrapping process has been supported by a number of community lead projects that convert and re-publish datasets to bring them into the web of data. The recently founded DataIncubator project (http://dataincubator.org) aims to adopt this same “show don’t tell” approach, but with the addition of some best practices and with an eye on long term sustainability.

Sustainability, Repeatability, Reusability

A key goal of the project is to lightly formalise the way these dataset conversions are carried out to make sure they are sustainable, repeatable, and reusable. But why are these particular aspects important?

Firstly, lets consider sustainability. As usage of the Linked Data cloud grows, we need to make sure that new data being added isn’t going to disappear later—e.g. because a small project website goes offline; or because the original project owner loses interest. It is critical that as serious applications begin to be built against this data that consumer can rely on it. One of the primary ways the project is ensuring sustainability is through making use of the Talis Connected Commons scheme (http://www.talis.com/cc). All of the public domain datasets that are converted and published through the DataIncubator project site are being hosted in the Talis Platform. This takes full advantage of the free data hosting offered under the Connected Commons initiative. Talis is therefore contributing to the sustainability of that data.

The second aspect to consider is repeatability. The first goal is to make sure that the data conversion process is itself repeatable—that is: we can easily re-generate the data to allow for modelling changes, bug fixes, and the ingesting of new data. And not just now when a project is active, but in three years time when the project may be picked up and extended by a number of other contributors. Ensuring that each of the incubated datasets is supported by open source code makes this more achievable. Ideally, the original dataset owners will be convinced by the benefits long before a project goes stale, but it’s important to recognise that evangelism can take time and that different industries move at different speeds. There are already a few Linked Data and RDF projects on the web that model and re-publish the same basic dataset in other ways. By trying to build a community around curating the conversion of a dataset and not just the data itself, DataIncubator hopes to avoid these issues.

The final aspect is one that is often over-looked: how can the original dataset owner build on what the community has created? How can the community’s efforts by reused? Reusability is enabled by ensuring that the conversion code is open source and that schemas and modelling design decisions are well documented. This can lower the barrier to entry facing data providers or publishers looking to embrace Semantic Web technology. This is the case particularly where the data conversion is acting on source data(e.g. open, but not linked data). In this case, the data owner may merely need to re-run the data conversion and publish the Linked Data through their own site rather than DataIncubator. This makes adoption much, much easier.

Community Norms

Alongside addressing these procedural aspects of the data conversion process, the DataIncubator project also encourages a number of useful community norms that will hopefully improve the quality of the converted datasets.

The first of these is to ensure that there is a sufficient amount of both linking and attribution. Every dataset within the umbrella project should reference its original sources. This should not take place just at a high-level, such as within in the corresponding Void description: http://rdfs.org/ns/void/. Instead, references should be deeper so resources can be associated with, for example, the original web pages that describe them. This ensures that there is a clear path back to the original source of the data. Attribution—in various forms—is an important community norm in its own right, but it is especially important in the context of converting and re-publishing an existing dataset. We want to ensure that the original curators of the data don’t think that the community is trying to appropriate or steal its work. Quite the opposite, we want them to embrace it.

The other norm relates once again to sustainability. Links to the data should be stable, but how do we achieve this if the data will ultimately be removed from the DataIncubator site and moved to another domain? The proposal here is that as data is migrated to its permanent home, redirects will be put into place to ensure that web browsers and semantic web agents can follow the links to their primary source. Every effort will be taken to ensure that links don’t break.

What’s In It?

The DataIncubator project already has a wide range of datasets available:

There’s a lot more that could yet be added to this list. My personal wishlist includes a conversion of the Prelinger Archives (http://www.archive.org/details/prelinger). This is hosted as part of the Internet Archive project and consists of over 2000 industrial, educational, travel, and propaganda videos published from 1903 to the 1970’s. The content is completely within the public domain, so it’s just begging to be converted. It would also be a great dataset on which to explore the modelling of media and media annotations in general.

Currently, one domain with very little Linked Data is gaming, in all of its forms. For example there is a vast amount of community curated data about Lego, Lego sets, and Lego models. And what about all of the facts and figures that are routinely collected around online gaming? Data might be available through specific community websites, but what could be built if the data were more open, allowing the community to analyse and re-present this data in new ways?

It strikes me that games and gaming is an area that is ripe for exploration. There are many interesting dimensions to the data, and the communities are very engaged. Many gamers are typically very interested in statistics and data about the games they play. This is just one area of the Linked Data landscape that the DataIncubator project is hoping to help explore.

Trends and Barriers

|This article first appeared in Nodalities Magazine, Issue 7

For anyone following the Nodalities blog, you may have read some of my recent posts discussing the trends boiling up around Web 3.0 (other buzzwords are available). The Mobile Web and upgraded connectivity in general; the rise of ubiquitous computing from chips in every product imaginable; Linked Data and the “Semantic Web” as an organising platform for this rising tide of data—these are three very broad trends seeing a lot of media attention presently. From where I’m standing, I tend to see the next great turning point of the Web as a convergence of some of these trends, and see it as a rise in the importance of and reliance upon data itself and data tools generally.

The mobile web is bringing new sorts of information to people, and they can make use of this info wherever they happen to be because of advances in devices ad connectivity. As phones and web-enabled devices get better, so to do the chips we seem to have embedded all over the place, and we can now begin to have a more clear picture of what we do through the information we gather from our heaters, cars, and pedometers. Also, as more objects become connected, the grunt-work of number-crunching and storage is becoming commoditised into big, efficient, utility-like cloud services, which host and work with our collected information much more effectively than the gadget in your hand could ever hope to do. Others, like us here at Talis, talk about the Semantic Web, which allows for an evolution from a bunch of connected documents to the explicit connections between bits of information.

Also fermenting in this mix is a strengthening trend of political transparency and a public, shared ownership of social data. Barack Obama’s new administration has clearly made this a priority with the launch and work around data.gov; and in the UK, Sir Tim Berners-Lee himself has been appointed to an Parliamentary advisory role. There is growing pressure to be able to have access to public data, and to see it as belonging to the nation’s people rather than allowed to be legitimately filed away in the great, locked bureau of the capitols.

So, picking up two fairly obvious trends here: Social, Public Data and Linked Data; it would seem to follow that people would begin to have access to previously unavailable information in usable, linked forms. And it’s certainly beginning, as articles elsewhere in this magazine have illustrated. But, what about other chunks of public data? What about when data comes from universities, institutions, scientific foundations and NGO’s? What about charities monitoring crime, CO2 emissions and family histories? Wouldn’t these make a useful piece in the web of social data? What resources have the governments themselves got, if they want to make their public-owned data available in a useful format?

These questions form a major part of the thinking behind Talis’ Connected Commons initiative (talis.com/cc). Basically, Talis has made its Semantic Web platform (including data hosting and access tools) available free of charge for any datasets made available to the public. In doing so, we’re hoping to remove the barrier of cost entirely to publishing interesting data in a Linked Data way. One major reason for this is to promote reuse and mashups of this interesting data, and for people to be able to “follow their noses” to the data that completes their projects. But, from a publishers’ perspective, this is important, because it’s removing a major reason not to bother with making data useful, if not only public. So, with this, data can be made public and useable and the developers and users get the benefit of public SPARQL endpoints and API access to interesting data.

To keep the data open and public, datasets need to make use of either the Public Domain Dedication and License (PDDL) or Creative Commons’ CC0 license. Ian Davis, in his article in this magazine, explains more about waivers and the Connected Commons, and there is a lot more about this particular initiative over on the Talis site (talis.com/platform/cc/faqs/).

In a recent interview with the BBC, Sir Tim said: “This is our data. This is our taxpayers’ money which has created this data, so I would like to be able to see it, please.” I wonder if initiatives such as Connected Commons will begin to remove excuses, hindrances, and obstacles? As public awareness of the importance of access gets hotter, this might become a political issue, as well as a pragmatic one. I hope that in the rush to publish data, and in the ensuing discussion and debate that follows, that the users, hackers and developers don’t get sidelined. I think the world is ready for its data back.

The Greatest Challenge Facing IT

by Lee Feigenbaum and Mike Cataldo

|This article features in Nodalities magazine, Issue 7

As the old adage goes: Time is money.

Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.

anzo-on-the-web-1A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.

This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.

Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.

Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.

Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.

Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.

In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.

The solution is to find data by its meaning rather than its location

The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.

anzo-on-the-web-2With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.

Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.

Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.

Data Collaboration and the Semantic Web

It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:

  • Flexible, define-as-it-arrives, data structures
  • Explicit relationships that travel with the data
  • Data that is accessed by its definition rather than its address
  • Distributed query

As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.

Practical Data Collaboration to Solve IT’s Challenge

Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:

  1. Virtualizes the data so that it is accessible by its description regardless of location.
  2. Lets users create their own views of data.
  3. Fills in the views by traversing the fabric and picking out the relevant information.
  4. Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.

The Right Data…

anzo-for-excel-1At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.

But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.

…For The Right People

Getting the data in front of the right people relies on three things: context, security, and “reach”.

Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.

Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.

Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.

…At The Right Time

Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.

First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.

Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.

Data Collaboration in the Days to Come

Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.

What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.

Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.

Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.

Enhanced by Zemanta

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

RDFa and Linked Data in UK government web-sites

By Mark Birbeck

| This article will feature in Nodalities Magazine, Issue 7

The UK government’s Central Office of Information had a straightforward problem to solve: how could they create a centralised web-site of information that the public could search and access, when the source of that information could be any government department
database or any public sector web-site?

For example, different organisations, such as Her Majesty’s Revenue and Customs (HMRC) or the National Health Service (NHS) would each post job vacancies to their own web-sites, but there was no central site that the public could go to, to find all public sector vacancies. This would be a problem at any time, but in the midst of attempts by the government to help people through the recession, it’s crucial to ensure that the public knows what vacancies are available. It might not occur to someone looking for a job as a plumber or an electrician they they should visit the NHS or Army web-sites, so a centralised site could make a big difference.

civil-service-vacancy

Similarly, as in most modern democracies, government departments are constantly seeking feedback from the public and interested parties, about specific issues. But as with job vacancies, these consultations are on departmental sites, rather than being available on a central site; from the Department of Energy and Climate Change (DECC) seeking feedback on clean coal, to the Ministry of Justice (MOJ) providing an opportunity for people to comment on prisoners’ voting rights, each department manages its own publication of consultations.

Traditional solutions

Traditional answers to these problems would have been to either (a) impose on each of the departments that they should key their data directly into a new central database (which would in turn drive the central web-site), or (b) create complex communication pipelines that would allow the decentralised databases to communicate with the central system.

And either of these solutions would almost certainly have turned out to have been a non-starter.

The first solution was unlikely to ever get off the ground, because it would have required each department to replace their existing technology with something new. Even if there was agreement on what that technology should be—and that in itself could take an age to resolve—there would have been a need for new development work, retraining of users, porting data from older systems, and so on.

The second ‘traditional’ solution at least has the merit of keeping existing systems intact, but would have required additional interfaces to be created to move the data from the departmental servers to the centre; each department would have had to create an interface between their own system and the central one.

Just getting one department into a situation where they could centralise their information would have been a major undertaking—not only were there lots of departments to consider, but each department was using a different technology to publish their vacancies or consultations to the web. For example, some departments with only a small number of job vacancies would likely use static HTML pages. Other departments, perhaps with larger IT departments, might use ASP.NET or a Java-based system.

Enter RDFa

The RDFa answer to this set of problems is simple—both conceptually, and to implement.

RDFa allows HTML publishers to embed RDF into their pages, so using the HTTP and HTML infrastructure to publish their information. This simple method of publishing data in turn means that any system can import this data, just by obtaining (or creating) an RDFa parser.

In short, each department can keep their own data management system, and simply add code to their existing web-page publishing step to augment the HTML with the data as RDFa. The central system in turn only needs one import mechanism—something that understands RDFa.

Adding this facility to an individual departments publishing system proved to be very quick and straightforward. But it’s not just UK government departments that are finding it straightforward to add RDFa to their pages. It was interesting to hear at SemTech in June that Google’s rich snippet launch partners (such as Yelp), were able to add RDFa support in “roughly a day”.

RDF publishing techniques

Adding data to web-pages might seem quite an obvious technique, but there are two important things to note here.

First, the COI has to be commended for having the vision to publish RDF at all. Of course, now that Gordon Brown has asked for Sir Tim Berners-Lee’s help in making government data publicly available, it seems pretty obvious—indeed it may even become fashionable! But the COI were planning this project at least a year ago, and at that time RDF was by no means a done deal (and you could say it’s still not).

But the second important thing is that even after deciding to publish RDF, it’s still not immediately obvious that the solution should involve RDFa, especially not a year ago.

The usual means of publishing RDF is to provide a distinct source of data in the form of RDF/XML (and perhaps other formats, too, such as N3). If there is an HTML version it usually exists for the purpose of describing the data itself. In other words, the RDF/XML format is primary, which means that anyone who is publishing HTML pages but wants to publish RDF as well, will need to add an extra piece of infrastructure that exists alongside their web-pages.

RDFa turns this on its head, and says that the HTML page is the data. One and the same page can be read as an HTML page, or as an RDF page, which in turn means that the changes required to the existing publication system are minimal. The COI once again showed its far-sightedness by adopting this technique.

Turtles all the way down

searchmonkey-fcoBut the benefits of RDFa don’t just stop there. Firstly, because the data is being published via HTTP and HTML, it’s possible for anyone to read the same data, not just the centralised web-site that was being planned. This means that third party job vacancy sites, for example, could import vacancies from relevant departments, to add to their databases. In fact, one of the main drivers for the consultations project was to try to help improve the accuracy of an already existing web-site (set up by a member of the public) that used ‘screen-scraping’ to try to keep up with the available consultations—RDFa provides much more accurate information.

rdfa-in-govIn addition, the centralised web-site will not only import RDFa but publish it too. This means that third-party servers are also able to import some or all of the centralised data, into their own sites.

And thirdly, by using RDFa the sites could provide information to search applications such as SearchMonkey.

As more servers both consume RDFa from one set of servers, and publish RDFa again to a variety of other servers, we enter the exciting world of Linked Data, and it’s ‘turtles all the way down’.

Conclusion

By using RDFa to address the challenge of making distributed data available in one place, the COI avoided having to make changes to each department’s systems. But once each department is publishing RDFa, it becomes possible for third parties to consume that information however they see fit. Such a flexible architecture is crucial in the age of open government, and is a cornerstone of linked open data.

Mark is managing director of Backplane Ltd. (http://webBackplane.com/), a London-based company involved in a number of RDFa/linked data projects for UK government departments. He is the original proposer of RDFa.