Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

A Year of Open Government Data: Transparency, but also Innovation

Screenshot of data.gov.ukTowards the end of 2010, Wikileaks generates many headlines as it publishes information on the web, causing controversy and leading to talk about politicians hiding information from the public. Reporters and commentators express shock or admiration when telling the story of a rogue organisation making governmental information public. What has not been as mainstream is that for the past year or more, governments around the world have been doing something very similar themselves: publishing information online.

Big names like President Obama, Sir Tim Berners-Lee and the headliners at big events like the International Open Government Data Conference favour publishing public data for transparency and benefits to society. This all finally began to take off in 2010. Governments from around the world have been developing their public information strategies, with the launches of data.gov and data.gov.uk and data.govt.nz.

This is all taking place at a time of economic restraint. Dr Martin Read from the UK Cabinet Office’s Efficiency Reform Board explained in a recent interview: “If you are going to improve the efficiency of something, making that change involves risk and innovation  … If they get it wrong, they’re hauled up in front of a committee for interrogation.” (moderngov, November 2010) It may seem tricky to justify the expense of big projects like data.gov.uk, and there certainly seems to be a huge amount of pressure.

Nevertheless, governments are proving themselves committed to prioritising data publishing. Towards the end of last year, the UK Prime Minister announced that every item of governmental spending over £25,000 will be published online, and updated monthly. He emphasised the importance of this publication in terms of transparency, inviting the public to scrutinise the data. Interestingly, he also said: “This scrutiny will act as a powerful straightjacket on spending, saving us a lot of money.” So, not only is data publishing seen as a benefit to democracy, but also as a useful way to “flag up waste”.

While that press conference was taking place, developers and civil servants were gathered together elsewhere at the Open Government Data Camp (disclosure, Talis was a sponsor). At the event, much was made of the modelling and tools which have been developed with open data in mind: particularly the Linked Data API, which allows developers from just about any web background to work with data.gov.uk’s data very quickly. Visualisations demonstrated what can be done with well-structured data.

One of the things this high-level data publishing has done is raise the standard for what can be published and developed. Last year, we built a proof-of-concept app for the Department of Business Innovation and Skills (BIS) to illustrate the potential of applications of this data. A few minutes spent on DEFRA’s UK Climate Projections site shows what can happen when raw data is matched with a plan, and is designed with a citizen in mind. Anyone can check the primary source for their government’s climate policy, and it doesn’t take a climatologist to understand it. A little further development allows fully-fledged applications to be built that are instantly useful: one available on the front page of data.gov.uk lets me download an app that helps me plan my cycle route!

Open government data is probably good for transparency. But it’s also got a plenty of potential to seed ideas that add value to this information. Innovators know that there are more people with better ideas outside our organisations than could possibly be in them, so sharing means that they can be developed into products and services that are mutually beneficial to everyone. The web industry routinely works with open-source software that’s been at least partly built by others, and this open-source mentality might just be an incredibly useful piece in the public-sector machinery. Open business models work very well with ideas.

2011 promises to be the year when all this data gets put to use. I was recently invited to a press conference at which the Deputy Prime Minister confirmed the UK’s commitment to published data as a priority and even a recognised civil liberty. The story will shift to more local applications of big public data tools. January will see the publication of local authority’s spending data, and public bodies will be looking to add value to this data, bringing the headlines of open data to life in the places we live.

With a bit of thought into how data is published in the first place, and a plan for encouraging people with good ideas to work with this information, this investment in data publishing could be more than just a tick-box exercise for a political transparency agenda. I hope that this year, it won’t be Wikileaks-level events that get people talking about open data publishing. We should notice it improving services we use, and see whole new applications for the bits and pieces of information that make up our public lives.

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

Philip (Flip) Kromer talks about InfoChimps and building a data marketplace

In my latest podcast I talk with Flip Kromer, co-founder of InfoChimps.

We explore the background to InfoChimps, and discuss their aspiration to build a marketplace in which people can contribute and find data – both freely available and commercial.

Open and Closed Case

So, we’ve been banging on about opening up access to public data for a while. Talis has put its money where its mouth is and helped to fund the PDDL to give organisations a legal framework for dedicating their data to the public domain. (We’ll even host open data for free on the Platform under the Connected Commons.) We see the benefits of open data being shared innovation, and many projects exist which make use of this data for scientific, journalistic, entertaining and just plain useful purposes. We’ve been seeing a strong trend towards removing siloes and encouraging reuse of information resources to the point that we’ve begun to create our own jargon around open access. This is great, and even governments are beginning to see the benefit of this with projects like data.gov and Sir Tim Berners-Lee’s advisory appointment in the UK.

But there is an alternate side to this story of opening up and sharing our data. Where there is open, there is an implied “closed” too. Some closed data is absolutely necessary—you wouldn’t argue that your recent financial transactions are data I should have a right to pry into, for obvious example. There is a lot of hidden data necessary to run applications and to make a profit, and it is entirely right that this should be the case.

But a recent case here in the UK has illustrated the point that if open data encourages innovation, closing down data can quash it. The Royal Mail recently sent cease and desist letters to the directors of ernestmarples.com, who had been providing online services with a set of API’s to turn UK postcodes into location information. This provision had enabled the building of services which, for example, let people look for jobs in their area, and monitor and map political leaflet claims. The Royal Mail charges £4000 to make use of its official list of Postcodes, and wasn’t happy with ernestmarples.org providing postcode data for free. (ernestmarples.com did not license the data, but scraped it from other sites, apparently.) As soon as ernestmarples.com stopped providing the lookup, all the services built on the data were stopped too. So, in effect, the data was enclosed again behind a barrier of a steep paywall and legal action.

There is a lot of discussion about whether the UK postcode data should be free anyway. It was funded by public funds, for one thing, and it only generates around £11million annually for Royal Mail. The subscription rate is high for startups or non-profits, especially when compared with the Zip-code data in the United States, which I found out only costs $500/year to purchase. {1} It could also be argued that the steep pricing is an archaic throw-back to a time where such services cost a lot to provide, so needed to be high in order to recover costs. But this reverse peppercorn rent could no longer be valid, and £4000 must certainly be an order of magnitude (or two) higher than the cost of provision.

There is a lot to discuss about specific datasets like this, and they may need to be tried legally and publicly before all the details are sorted out, but this case is about as illustrative as possible of the principle of encouraging innovation. A single, simple and non-charging service provided a framework for thousands of users for mostly socially-beneficial aims. Imagine the impact if hundreds of source-services had access to postcode data? Perhaps tens of thousands of users could look for employment, or track their local governmental organisations’ progress. Who knows what else might have been developed? It doesn’t take a huge leap of imagination to envision services tailored to your very local locality, does it? Just as easily, though, the enclosure of a single database has cut off a huge network of potential innovation.

The Guardian has covered the story, if you want more details too.

Photo: “Open/Close” bymag3737 via flickr, Creative Commons License

{1} I’m not entirely sure about the licensing of the Zip-code data, but the representative I spoke with at USPS said you can purchase the 5-digit Zipcode product for $500/a.

Britain 2.5

It’s hardly new for this blog or our community to cover issues of open access and making information useful for users. But, what if we were to begin speaking in terms  such as: “A call for transparency,” or subtly replace user with citizen?  With little substantive shift of core meaning, the whole message becomes one of rights, responsibilities, and public duty.

I’ve been watching this week as the ember at the heart of this dialogue has been fanned with air-time on mainstream media, and is about to receive its fuel. First, UK Prime Minister Gorden Brown asked Sir Tim Berners-Lee  ”to help us drive the opening up of access to Government data in the web over the coming months” appointing him to a special role advising Parliament. In an interview with BBC tech correspondent Rory Cellan-Jones, Sir Tim discussed his position; explaining that he’s pushing for transparency: “This is our data. This is our taxpayers’ money which has created this data, so I would like to be able to see it, please.”

Sir Tim had the audience at the tech-friendly TED conference chanting “Raw Data Now” back in February, and he’s now been invited by a sitting government leader to make this happen.

This week also saw the publication of the Digital Britain report, outlining Parliament’s plans for a more connected future. I must admit, for the record, that I haven’t read all 239 pages of the report (made available via bbc.co.uk), rather, I’ve skimmed it and read several overviews. The gist seems to be that the UK plans to invest in the future of its citizens’ internet connectivity, upgrading existing infrastructure and providing access where there currently isn’t. This investment will cover both wired broadband provision (with a stated aim of 2MBps minimum for every household) and wireless, encouraging investment in 3g provision by allowing mobile companies to have their network licenses more permanently.  It recommends subsidising development wherever the market can’t provide; seemingly equating net access with public utilities (The PM further clarified his thoughts by saying the Internet is as vital as water or gas). More information on this report can be found on the summary page at the Guardian, on twitter: hashtag #digitalbritain, and Bill Thompson’s tech-centric overview.

All this week needs is a major announcement of something moving entirely to cloud-computing to look a bit like the convergence I blogged about a few days ago ;) .

So, what has this incredible week brought us? It’s a governmental lead on opening up access to data. Their appointment of TBL makes me think that it’s likely we’ll see more and more linked-data projects coming from the public sector (not just access to, but usable, linked data). Over the next few years, the UK plans to improve its infrastructure and incentivize development on communications networks, and they’ve begun to use language suggesting that being part of the network and access to Public data are rights issues.

Sir Tim spoke, in the interview, about beginning with low-hanging fruit: pilot schemes which open up data and watch what happens.

What are you building?

Image: “Sparks”, by Steven Wong via flickr; Creative Commons By, Share Alike License

Web two dot oh plus one, in the cloud, with bells on…

The tech world is telling a story about the Web and computing, and the mainstream media seem to be catching on. They’re hearing about clouds, wikis, and the history of the World-wide Web. The whole thing reads like some sort of legend…

It was an era, long ago, when the folk of Middle Class plugged in their Mo-Dems and listened to arcane, magical sounds as their £120 beige box enabled a blazing 14.4 kb/s connection, and they only had to wait a few minutes to call forth script and from anywhere on earth. It was an age that saw the beginnings of email, where people composed messages and sent them down the phone lines at lightning speeds (unless a packet dropped…). This was the time of Web 1.0.

Then, the web collapsed. No one used the internet any more. Modems became paperweights and millions of metres of ethernet cable were grubbed up to make room for under-floor heat in offices. The world was quiet, and the people of Middle Class forgot what they knew.

Until, there dawned the advent of Web 2.0. People re-learned their former ways, and improved upon the innovations of their fathers. Instead of sites and pages, they began to use “Web Apps” which accomplished Tasks, and they became their masters. The great titan Google was made, and he knew all and directed the world toward knowledge. The elves of the web taught men the ways of blogging and messenging and eventually (when they’d mastered all these things with wiki-training to boot) Social Media and Networking.

Only, that’s not exactly how it happened; is it? Many commentators and Alpha Geeks have divided the story of the web into convenient phases, and they’ve roughly settled around a versioning metaphor common to software. Have a look at your favorite browser, and you’ll see a version number (Safari 4 for me, if you’re interested) which lets you know how many iterations have been and gone before. There are certainly noted differences, and turning points, where people phased out their dependence on one thing for the convenience and utility of something better. Tim O’Reilly, who coined the phrase Web 2.0, wrote a much-linked post in 2005 trying to explain and crystalise some of the trends he was seeing which were different from the first few years of the web. The fact that he had to clarify what he meant, and that it took the non-geek world three years to catch up testifies to the notion that the change was gradual. It makes me think that we missed out all the .1-.10′s in the version numbers, and many alpha and beta tests along the way.

Now we are engaged in the great Web 3.0, where we are applying the logic of the past to the present and guessing at the future. Only, because no one is actually releasing versions of the web like a good, reliable software company should, the story is much more complicated—and interesting!

There are notable trends, with backers and bloggers riding various waves. But it seems to me that the point of this is a convergence. The mobile web is bringing new sorts of information to people, and they can make use of this info wherever they happen to be because of advances in devices ad connectivity. As phones and web-enabled devices get better, so to do the chips we seem to have embedded all over the place, and we can now begin to have a more clear picture of what we do through the information we gather from our heaters, cars, and pedometers. Also, as more objects become connected, the grunt-work of number-crunching and storage is becoming commoditized into big, efficient, utility-like cloud services, which host and work with our collected information much more effectively than the gadget in your hand could ever hope to do. Others, like ourselves, talk about the Semantic Web, which allows for an evolution from a bunch of connected documents to the explicit connections between bits of information.

But, I see a trend there which is common to all candidates: information. The web allowed for information to be shared, then collaboratively worked. Now, I see this information becoming useful in and of itself…as data.

Walt Mossberg talks about Web 3.0 as if it is riding on the backs of mobile and connected devices. And I think it probably is. Tim Berners-Lee recently spoke to the BBC about the future of the web including some incredible future of pixels everywhere, where any surface could display information. He’s also repeatedly talked about the future of the web being semantic (he invented the term, let’s not forget) where Linked Data is the web done right. And who am I to argue with the inventor of the Web?

But I don’t think there’s so much a conflict or competition as a coming together here. If there will be a Web 3.0 (and it seems a likely, media-friendly label), I think it will include all of these trends centred around the focus of data. The connected devices allow us access to cloud-computing and storage (computing and storage of data…). Many chips gather data about ourselves, which we can use to personalise our view on the web of data, and the Linking of this data through semantics lets it all be calculable, programmable, and useful. It kind of reminds me of a computer, you know… The chips and our collective use of web applications are input and sources, and the various devices we use are displays and UI’s onto a massive, scalable CPU in the cloud. Linked Data could be the Operating System, allowing and enabling anything to be connected and programmed.

Web 3.0, to me, is a convergence of the trends, and it’s all about data. It’s not a simple story, and any convenient label is to convenient to be comprehensive, but I’m pretty sure the next things will all centre on our ability to make use of and personalise vast chunks of previously-obtuse data.

Image “#Black rain : Convergence” by FredArmitage via flickr—Creative Commons License.

Announcing the Talis Connected Commons

Here at Talis we’re very pleased and excited to be announcing a new scheme that we’re calling the Talis Connected Commons.

We’ve invested a lot of time and energy over the last few years in evangelising the importance of linked open data. Along the way we’ve funded development of open data licenses to help provide the legal framework to support open data projects, and have followed our own advice and shared data with the communities surrounding our own products. And throughout this time we’ve been hard at work not only building the Talis Platform, but also using its flexibility to re-develop our own products.

We felt it was time to start bringing those two strands together and allow other people to really start using the Platform. For a while now we’ve let a number of developers have access to the platform for the purposes of prototyping and experimentation, but we recognise that for the Platform to become a serious component in the semantic web infrastructure that it needs to be offered on a more formal basis. The Talis Connected Commons scheme is the first step towards achieving this, and we think its a big one; not only for us, but also for the open data community in general.

True to our desire to see a truly open web of data, under the terms of the Connected Commons scheme Talis is offering free access to the Platform for the purposes of hosting public domain data. And the offer isn’t just limited to free hosting: the data access services, including access to a public SPARQL endpoint, are also freely available.

The terms of the offer are as follows: if you own, or are creating, a public domain dataset then you can store that data in the Platform as RDF, for free. We’re setting an initial cap of 50 million triples on each dataset, but thats should be plenty of space in which to collect some really interesting data. To qualify for the scheme, you need to be using either the Open Data Commons Public Domain Dedication and License or the recently launched Creative Commons CC0 license to publish your data. Anyone will then be able to freely access the stored data using the Platform services, without API keys and without usage limits. This means that your data will be wrapped in a ready made API right from the start.

The Platform API covers basic data management facilities, through to a configurable search engine and a fully compliant SPARQL endpoint. And with data being delivered in a range of formats including RDF/XML and JSON, there should be something there for everyone to get their teeth into no matter what kind of application you’re building or environment you’re working in.

For more information on the details of the offer visit the Connected Commons homepage. We’ve prepared a lengthy set of frequently asked questions that should hopefully clarify any other questions you might have. If not, then feel free to send in a comment and we’ll try and address your questions.