Nodalities

From Semantic Web to Web of Data
Nodalities

Subscribe

  • Any Podcatcher
  • Any Feed Reader

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

Martin Belam Talks with Talis

Martin BelamIn this Nodalities Podcast, I talk with blogger and Guardian information architect Martin Belam. I’ve run into Martin at a few Linked Data events where the news and media industries have had a high profile (including the recent News Media Summit, and News Innovation conference last year). Martin has an interest in Linked Data, and an interesting perspective on where it fits in with News, both as a tool for journalism and research and as a resource for the industry.

Also mentioned:
Guardian Open Platform

 
 Martin Belam Talks with Talis [25:36m]: Play Now | Play in Popup | Download (178)

We’re excited

Yay!The Talis offices, for the past few weeks, have been awash with geeky excitement—that kind of near giddy excitement that comes with eager expectation. We’ve all been waiting for something important.

For some, this was no doubt augmented with the announcement of Steve’s new iPad; but that’s not what’s gotten us all worked up.

For months, we’ve been looking forward to the launch of data.gov.uk; and last week, the wraps finally came off. The official press release put it:

A major new website has been launched to the public which gives anyone who wants to use it unprecedented and free access to government data in one place.

This doesn’t quite capture the coolness of the launch, for me. Yes, it’s a major new website, and it’s point is to publish information. But, the exciting thing is that this information is being published as data: data that can be used, reused, remixed and enriched. Sir Tim Berners-Lee’s perspective was more exciting:

Making public data available for re-use is about increasing accountability and transparency and letting people create new, innovative ways of using it. Government data should be a public resource. By releasing it, we can unlock new ideas for delivering public services, help communities and society work better, and let talented entrepreneurs and engineers create new businesses and services.

The point is that this public resource is finally getting a home on the web, and an infrastructure to make it not just available, but useful.

The exceptional team behind data.gov.uk have striven to adhere to web standards in its production: including Linked Data as a priority, as Professor Nigel Shadbolt explained:

We are also going to increase the use of ‘Linked Data’ standards, which allows people to provide data in a way that is as flexible and easy-to-use as possible.

Back in November, Leigh Dodds wrote a post explaining how we’ve been involved, and there’s an official Talis Platform press release too. Basically, we’ve been working with the data.gov.uk team to help with the Linked Data part of the site—hosting the SPARQL endpoints and providing consultancy and training, for example.

I can confidently say that we’re very proud of data.gov.uk, the team behind it, and our involvement with it. We’re excited by the prospect of this data being used as raw material for clever people to make interesting, useful, even world-changing things with it. We’ve seen the beginnings and proof-of-concept projects already.

Now comes the really exciting stuff. What are you going to build?

Image: “Yay for happy days!” by le vent le cri via flickr (CC: By)

In conversation with Conrad Wolfram

Conrad_Wolfram The subject of this Talking with Talis Nodalities Podcast is Conrad Wolfram, founder and Managing Director of Wolfram Research Europe.  He is also Strategic and International Director for Wolfram Research, the organisation founded by his brother Stephen, and responsible for Mathematica software and the WolframAlpha Knowledge Engine.

In our wide ranging conversation we look at Conrad’s career, the evolution of Wolfram Research and its role in introducing wider access to computational functionality.   He takes us through the creation of Mathematica by Stephen Conrad and building a company based upon maths.

We move on to discus the WolframAlpha Knowledge Engine, which is built upon Mathematica technology, and how it fits both in to the online world and the Wolfram strategy.  We close having discussed many issues relevant to the evolution and future of the web.

Photo Copyright © 2009, Conrad Wolfram.

 
 Conrad Wolfram talks with Talis: Play Now | Play in Popup | Download (266)

Philip (Flip) Kromer talks about InfoChimps and building a data marketplace

In my latest podcast I talk with Flip Kromer, co-founder of InfoChimps.

We explore the background to InfoChimps, and discuss their aspiration to build a marketplace in which people can contribute and find data – both freely available and commercial.

 
 Standard Podcast [47:09m]: Play Now | Play in Popup | Download (394)

Felix Van de Maele talks about Collibra

In my latest podcast I talk with Felix Van de Maele, CEO of Belgian semantic technology company Collibra.

We discuss Collibra, and the problems that many enterprises face in understanding and integrating data held in diverse silos.

 
 Standard Podcast [23:40m]: Play Now | Play in Popup | Download (312)

The Semantic Web and Linked Data – In Action

online09Following the Online Information Conference 2009, at which I demonstrated live examples of Linked Data in action, I have been asked several times if my presentation had been videoed.

Unfortunately it had not. So I have tried to recreate the presentation, if not the atmosphere, by recording this screencast.   In an attempt to find a quite uninterrupted environment, I recorded this early on a Sunday morning.  I hope therefore you will forgive the odd clink of the first coffee cup of the day.

A slides version of the presentation is also available on SlideShare.

 
 Podcast Video: Play Now | Play in Popup | Download (461)

Linked Data In Action at Online Information 2009

online09 Today I had the pleasure of delivering a presentation in the Semantic Web track at Online Information in London.  Sharing the stage with David Pullinger, Head of Digital Policy, COI,UK Government, John Sheridan, Head of e-Services and Strategy, Office of Public Sector Information and Eero Hyvonen, Professor, Semantic Media Technology, Helsinki University of Technology, Finland, in a session that followed on from a keynote Talis CTO, Ian Davis, it was my challenge to ‘demonstrate’ Linked Data in action.

After a few opening scene setting slides I took my life in my hands, opened up a selection of web browsers to embark on a tour of live examples of Linked Data in action, many of which being underpinned by data in Talis Platform stores.  Fortunately my 3G connection held up for the duration, and I got to the end without any long waits watching spinning web browser icons.

For those that asked, here is a SlideShare version of my presentation, including screenshots of what I showed, and links to some of the sites I visited.


By rob

 

Update: The FanHubz application merging Twitter and BBC programmes data, shown in Ian Davis’ keynote earlier in the day, can be found here: fanhu.bz

Dame Wendy Hall – The Semantic Web Revolution

Wendy Hall online09 For this final podcast in the Online Information Conference 2009 series, I caught up with Dame Wendy Hall in a very echoey room at the Royal Society in London.

Wendy is sharing the opening keynote session on the first day of the conference with Nigel Shadbolt.  Their session has the title The Semantic Web Revolution – Unleashing the world’s most valuable information.  This promises to be a great session to catch with a combination of slide-backed presentation, conversation, and Q&A.

In this conversation with Wendy, we explore the approaching step-change evolution that Semantic Web technologies, specifically Linked Data, will  bring to the online world over the next few years.  An evolution to the web that is already in place, not the totally disruptive revolution that the web itself was.  Wendy makes it clear that, as with the document web of the mid nineties prior to the emergence of Google for instance, it is difficult to predict how the data web will turn out and what tools, services, and business models will emerge.  Nevertheless, the opportunities for innovation, building on the unleashing of distributed yet interlinked data, will be massive.

An interesting insight in to the thinking that will underpin what promises to be a great opening session to an equally good conference.

 
 Dame Wendy Hall Talks with Talis [37:18m]: Play Now | Play in Popup | Download (564)

Microsoft Bing’s Antonio Gulli Talks with Talis

Antonio Gulli online09 Antonio has been working in the area of Web search for eleven years.  He has recently joined Microsoft as principle developer manager based in London.

Presenting in the The realtime web: Discovery vs. Search session on the 1st day of the Online Information 2009 conference, Antonio brings an insight in to the challenges of realtime search.  Traditional search systems utilise pointers between pages to ascertain relevance and importance.  In a realtime environment, those references will have not been created.  News items from the BBC are obviously of high reputation, but are they as important as the local paper when you are in Iowa.

How do you calculate the relevance images, or videos, or a stream of information from an event such as an earthquake.  How do you calculate the importance of various social services such as FaceBook and Twitter.  Without giving away the secret recipe behind the way Bing approaches this, his explanations set the scene for these very real challenges.

 
 Antonio Gulli Talks with Talis [43:09m]: Play Now | Play in Popup | Download (374)