Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Talis Sponsor Pan-European Open Data Challenge

opendatachallenge We are proud to be a Lead Sponsor for the Open Data Challenge being coordinated by Jonathan Gray from the Open Knowledge Foundation and Paul Meller from the Open Forum Academy, under the auspices of the Share PSI initiative.

This is a significant competition, with significant prizes totalling €20,000 for ideas, applications, visualisations and datasets – up to €5,000! 

As you would expect from a Talis Sponsored competition, Linked Data features in the line up of attributes that entrants should be considering.   Following the 5 Star Data principles espoused by Sir Tim Bereners-lee, the more machine readable, non-proprietary formatted, and linked that Open Data can be, the lower is the barrier to its innovative use.   This is especially true in the area of Public Sector Information, with similar or associated data is being published by several organisations or governments.  In recognition of this we are, as part of our sponsorship, backing the Talis Award for Linked Data – €1,000 presented for the best use of Linked Data in any of the competition categories.

The competition will run for 60 days, so get your ideas flowing, and developers fingers rattling over those keys.

Watch out for a later post, when I will  identify some Linked Open Data that is already available that you could use to build an entry.

Are We Getting A Right to Data?

Friday night – nothing on the TV – I know! I’ll browse through the Protection of Freedoms Bill, currently passing through the UK Parliament. Sad I know, but interesting.

Government spending data published %007C Number10.gov.uk Lets scroll back in time a bit to November 19th 2010 and a government press conference introduced by a video from Prime Minister David Cameron.  The headline story was about the publishing of government spending and contract data, but towards the end of this 109 second short he said the following:

… the most exciting is a new right to data. Which will let people request streams of government information and use it for social or commercial purposes.  Take all this together and we really can make this one of the most open, accountable and transparent governments there is.  Let me end by saying this. You are going to have so much information about what we do, how much of your money we spend doing it, and what the outcome is.  So use it, exploit it, hold us to account.  Together we can set a great example of what a modern democracy aught to look like. (my emphasis)

Obviously to realise this Right to Data there needs to be some legislation, which brings me to the Protection of Freedoms Bill.  This is one of those bills which covers all sorts of issues, from rules for destruction of fingerprints and DNA profiles, CCTV camera regulations, detention of terrorist suspects, to freedom of information and data protection.  Zooming in on the bits on the topic of the release and publication of datasets held by public authorities, we find a set of clauses that amend the Freedom of Information Act 2000

Re-use

After some amendments which allow for datasets and provision in electronic form we get this: “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.”  Unfortunately there is no definition of the term re-use.  It could be argued that a pdf of some tables in a MS Word document could be re-used, where as I believe the spirit of the legislation should be made more explicit to by identifying non-proprietary data formats.  I know this would be a tricky job for the parliamentary draftsmen, as we would not want to restrict it to things, such as XML and csv, that could age and be replaced by something better which then could not be used as it had not been mentioned in the legislation, but I believe that just using the term ‘re-use’ is far too woolly and open to [mis]interpretation.

What is [not] a dataset

This is one of the areas that raises most concern for me. Checkout this wording from the Bill:text1 I am OK with (a) – data collected as part of an authority doing it’s job – and (c) – don’t change the data you have collected – publishing that raw data is important.  However (b) specifically excludes data that is the product of analysis.  Presumably analysis of collected data is one significant way that an authority measures the outcomes of its efforts.  Understanding that analysis will help understand the subsequent decisions and actions they make and take.  I assume that there may be some specific reasons that underpin this blanket exclusion of analysis data.  If there are, they should be identified, instead of generally throttling the output of useful data that will go a long way to helping with Mr Cameron’s stated ambition for us to be able to see “what the outcome is” of the spending of public money.

Release of datasets for re-use

This is a whole new section (11A)  to be added to the 2000 act to cover the release of datasets. It covers ownership, copyright, and/or database right of the information to be published and states that it should be published under “the licence specified by the Secretary of State in a code of practice issued under section 45”. Section 45 basically puts in to the hands of the Secretary of State the definition of the license(s) data should be published under.  As of today the Open Government Licence for public sector information is what is wanted to keep the publishing of information open.  However, what is there to stop a future Secretary of State, who has a less open outlook in replacing it with far more restrictive licences?  Do we not need some form of presumption of openness being attached to the Secretary of States powers as part of this change in legislation?

On the topic of presumptions of openness, the wording of this bill contains phrases such as “unless the authority is satisfied that it is not appropriate for the dataset to be published” and “where reasonably practicable”.  It is clear that many in the public sector are not as enthusiastic about publishing data as the current government position and such vague phrases as these may well be unreasonably used by some in justifying a throttling of the stream of information.   They could easily be used to build in a bureaucratic decision hurdle for each dataset to have to jump, proving its appropriateness and practicality, before publication.  I am sure that it would not be beyond a parliamentary draftsman’s skill to produce wording that means that all will be published, unless a specific objection is raised for an individual dataset, for reasons of excessive effort or data protection reasons.

Up-dated data

Data published by an authority should be published under a scheme, the following applies here:Protection of Freedoms Bill (HC Bill 146)How should we interpret “any up-dated version held by the authority of such a dataset”? My interpretation is that once a dataset has been published is shall continue to be published as it changes.  The precedent for this is spending data – having published authority spending for January 2011, authorities should be automatically publishing it for February and following months.  But what if, in response to a request, an authority publishes the contents of a spreadsheet used to track the amount of salt applied to roads in its area during winter 2010-11 and then uses a different spreadsheet for the following winter.  Does the output of that new spreadsheet constitute a new dataset, or an up-date to it’s predecessor?  From the wording in the Bill it is not clear.

Who does it cover?

I probably need a bit of help here from those that understand the public sector better than I do, but I am suspicious that references to the organisations listed in Schedule 1 and “the wider public sector”, do not take the net wide enough to cover some of the data that is relevant to our daily lives but is delivered on behalf of some authorities by third parties.  For example I am aware that recently a large city was not able to inform citizens of their rubbish collection schedules because that data was considered as commercially restricted by their service provider.

 

So in summary, I welcome the commitment to a right to data being realised by streams of government information about what we do, how much of our money is spend doing it, and what the outcomes are.  However, I am sceptical as to how effective the measures in the current Protection of Freedoms Bill will be in delivering them.  Especially in the light of very recent comments made by the Prime Minister highlighting the "enemies of enterprise" in Whitehall and town halls across the country, attacking what he called the "mad" bureaucracy that holds back entrepreneurs.  Those enemies are just the people who might take the wording of this bill as ammunition in their cause.

mug Whilst being concerned about this topic, I have been wondering why few are commenting on it.  Are the majority just taking the press conference statements by David Cameron, and his fellow Ministers, as indications of a battle won, or am I missing something?  I promote Sir Tim Berners-Lee’s 5 Star Data as the steps towards a Web of Linked Data – if we don’t get the publishing of public sector data to at least 3 star standard (Available as machine-readable structured data – in non-proprietary format), many of the current ambitions may remain just that, ambitions.  That would be a massive missed opportunity. 

So are we getting a right to data? – or just some provisions to extend the Freedom of Information Act a bit further in the dataset direction?  I’m not sure.

Personal note: As you may tell from the above, I am no expert on the interpretation of parliamentary legislation, and I have left several unanswered questions hanging in this post.  Any help in clarifying my thinking, confirming or disproving my assumptions, or answering some of those questions, will be gratefully received in comments to this post or your own posted thoughts.

A Year of Open Government Data: Transparency, but also Innovation

Screenshot of data.gov.ukTowards the end of 2010, Wikileaks generates many headlines as it publishes information on the web, causing controversy and leading to talk about politicians hiding information from the public. Reporters and commentators express shock or admiration when telling the story of a rogue organisation making governmental information public. What has not been as mainstream is that for the past year or more, governments around the world have been doing something very similar themselves: publishing information online.

Big names like President Obama, Sir Tim Berners-Lee and the headliners at big events like the International Open Government Data Conference favour publishing public data for transparency and benefits to society. This all finally began to take off in 2010. Governments from around the world have been developing their public information strategies, with the launches of data.gov and data.gov.uk and data.govt.nz.

This is all taking place at a time of economic restraint. Dr Martin Read from the UK Cabinet Office’s Efficiency Reform Board explained in a recent interview: “If you are going to improve the efficiency of something, making that change involves risk and innovation  … If they get it wrong, they’re hauled up in front of a committee for interrogation.” (moderngov, November 2010) It may seem tricky to justify the expense of big projects like data.gov.uk, and there certainly seems to be a huge amount of pressure.

Nevertheless, governments are proving themselves committed to prioritising data publishing. Towards the end of last year, the UK Prime Minister announced that every item of governmental spending over £25,000 will be published online, and updated monthly. He emphasised the importance of this publication in terms of transparency, inviting the public to scrutinise the data. Interestingly, he also said: “This scrutiny will act as a powerful straightjacket on spending, saving us a lot of money.” So, not only is data publishing seen as a benefit to democracy, but also as a useful way to “flag up waste”.

While that press conference was taking place, developers and civil servants were gathered together elsewhere at the Open Government Data Camp (disclosure, Talis was a sponsor). At the event, much was made of the modelling and tools which have been developed with open data in mind: particularly the Linked Data API, which allows developers from just about any web background to work with data.gov.uk’s data very quickly. Visualisations demonstrated what can be done with well-structured data.

One of the things this high-level data publishing has done is raise the standard for what can be published and developed. Last year, we built a proof-of-concept app for the Department of Business Innovation and Skills (BIS) to illustrate the potential of applications of this data. A few minutes spent on DEFRA’s UK Climate Projections site shows what can happen when raw data is matched with a plan, and is designed with a citizen in mind. Anyone can check the primary source for their government’s climate policy, and it doesn’t take a climatologist to understand it. A little further development allows fully-fledged applications to be built that are instantly useful: one available on the front page of data.gov.uk lets me download an app that helps me plan my cycle route!

Open government data is probably good for transparency. But it’s also got a plenty of potential to seed ideas that add value to this information. Innovators know that there are more people with better ideas outside our organisations than could possibly be in them, so sharing means that they can be developed into products and services that are mutually beneficial to everyone. The web industry routinely works with open-source software that’s been at least partly built by others, and this open-source mentality might just be an incredibly useful piece in the public-sector machinery. Open business models work very well with ideas.

2011 promises to be the year when all this data gets put to use. I was recently invited to a press conference at which the Deputy Prime Minister confirmed the UK’s commitment to published data as a priority and even a recognised civil liberty. The story will shift to more local applications of big public data tools. January will see the publication of local authority’s spending data, and public bodies will be looking to add value to this data, bringing the headlines of open data to life in the places we live.

With a bit of thought into how data is published in the first place, and a plan for encouraging people with good ideas to work with this information, this investment in data publishing could be more than just a tick-box exercise for a political transparency agenda. I hope that this year, it won’t be Wikileaks-level events that get people talking about open data publishing. We should notice it improving services we use, and see whole new applications for the bits and pieces of information that make up our public lives.

Linked Data – Coming Together

hannibal To quote John ‘Hannibal’ Smith, from that wonderful bit of 1980s TV, “I love it when a plan comes together!”.   Of course aficionados of the A-Team will probably remember ‘the plan’ was often only apparent in retrospect, although it’s general intention was clear from the start.

The adoption of  Linked Data and the realisation of all that potential benefit, is looking a bit like an A-Team episode – the eventual outcome being clear from the start, but with many setbacks, skirmishes to fight, partners to woo, nerves to calm, and teams to lead on the way.

To break the metaphor at this point, I see Linked Data as more of a shared vision than a plan laid out before us.  Nevertheless, I think we are staring to see elements of it ‘starting to come together’.

One very obvious example, is what Ordnance Survey is doing by continuing to open up their location data.  Now that OS have defined a URI for every UK postcode unit [eg. ‘SO16 4GU’ = http://data.ordnancesurvey.co.uk/id/postcodeunit/SO164GU], why would anyone [re-]publishing data in the future not use these identifiers to reference their postcode information?  By that simple step they will be linked in with a wealth of ancillary information about the location – easting/northing, ward, district, county, country, etc.

Goodwin BIS Great I hear you say, but show me an example of what that could lead to!  Being lazy, I’ll let the inimitable John Goodwin of the OS do it for me.  In his recent appropriately named “So what can I do with the new Ordnance Survey Linked Data?” post, he shows how by merging data from a previous Talis project, produced for the Department of Innovation and Skills, he can deliver a very different way of accessing the same data. 

The BIS Research Funding Explorer project brought together data about UK Government research funding, from several research councils and the Intellectual Property Office, and brought them together in a Linked Data driven application to display UK centres of research excellence. 

John explains how by mixing Linked Data, published for that project, with OS Linked Data, he has been able to develop a different way of accessing the data.  In his, prototype, application you are presented with a map of the UK showing the regions as defined by the European Union.  By clicking on one of the EU regions you are presented with a list of the projects from within that area.  He has also added the ability to access by county or District/Unitary Authority. A simple, but effective, way of demonstrating that data, in Linked Data form, from one source can be easily combined with data from another source to deliver benefit.

Of course even with this example we are seeing the effect of joining just a couple of jigsaw pieces together.  With Linked Data, such as this from OS, being published at an ever increasing rate, it will not be long before a bigger picture starts to form as more and more data pieces are linked together.

I love it when you can see a plan coming together!

Focus on Local Government Spending

The UK Government Transparency agenda is encouraging Local Government as well as National Government to publish its data as Open Data and Linked Data, reflecting the world leading progress that data.gov.uk has made on these fronts over the last year and a bit.

I am sat in the opening session of Socitm 2010 conference, in sunny Brighton, whilst writing this.  Already it is clear that local government spending is a major issue for the sector.  In it’s broad sense, of how much local authorities can [or cannot] spend

, it is providing the background for the whole conference.  Not doom and glom here though.  IT could be seen as a knight in shining armour  to help the public sector deliver better services what the encouraging thought proffered by Louisa Preston as she launched the day.  In its more narrow sense, the requirement to publish data about all local government spending items over £500 from January 2011 onwards, it gives a focused example of the opportunity for a significant change in thinking and practice by the sector.

As Nodalities readers are well aware, Linked Data tools, techniques, and technologies have massive potential to simplify the publishing, linking, aggregating, and making data work across a web of data.  It is no coincidence that data.gov.uk is making steady valuable progress publishing key data sets in linked data form in the Talis Platform – it is an obvious step.  For many in local government, linked data is something they have never met before.   For them the, traditionally unnatural, step of openly publishing what in the past would have been a private report out of the back of their finance system, is a significant step in itself.

It is the responsibility of those of us, who understand the benefits of taking the extra step beyond just publishing a simple csv file to publish in Linked data form, to make it easy for all authorities to understand and take the combined step of publishing Linked Data from the start.

To that end, we at Talis recently announced a free stores offer for all UK local authorities to publish their spending data as Linked Data.

Traditionally our approach would be host a free open day to help those in local government understand Linked Data and the benefits to them.  Recognising the broader economic climate, and its influence on local government spending in that broader sense, that doesn’t seem to be a good idea.

LGID Many organisations, not least Socitm (there is a Linked Data session at the conference today) and the Local Government Group, in the sector are looking to promote this approach.  We are therefore going to work with the sector to promote this message.

To that end we are to participate in the Open Data strand of the free Local by Social online conference, 3 – 9 November being hosted by LGID. 

As well as checking out, what looks to be a quality online event, stay tuned to the Talis initiatives in this area.

Linked Open Data and Pavlova

rjw_caricature_mini If Sir Tim Berners-Lee can equate Linked Data with a packet of  crisps/potato chips, I thought I would take a stab at another food metaphor for this post. 

Linked Open Data (LOD) is a concept that many believe they understand.  Take yourself to most any conference that has a connection with data, or the web, or the Internet at the moment, and it will not belong before you see a slide of the Linked Open Data cloud diagram, or of Sir Tim imploring us to give him our raw data now, or if you are very lucky a shot of him doing his imploring whilst stood in front of a shot of the LOD cloud.  -  Simple really, just publish your data as Linked Open Data and all will be wonderful as we move towards the sunlit Semantic Web uplands.  Unfortunately life is never that simple – LOD is not a single identifiable thing.  As Paul Walk eloquently puts it:

  1. data can be open, while not being linked
  2. data can be linked, while not being open
  3. data which is both open and linked is increasingly viable
  4. the Semantic Web can only function with data which is both open and linked

As with any recipe for success, the majority concentrate on the final result.  Praising or criticising it as a whole, without identifying the benefits or otherwise, of the individual ingredients.  Take a strawberry pavlova for instance.  If you you are in to that kind of thing, a delightful culmination of the culinary arts designed to send your taste buds in to raptures.  Unless that is, you don’t like cream, or you don’t like strawberries, or can’t abide meringue, in which case the whole thing seems a little pointless.

What has this got to do with Linked Open Data (LOD), I hear you ask.  Well, I am increasingly seeing LOD being presented as the goal for those wishing to publish their data on line.  My position is that the eventual goal, from which will spring a Semantic Web, is a global web of linked and open data. However, there are many steps from where we are now to achieving that goal.  Within audiences that I present to, and/or sit amongst, I see people who for whatever reasons do not ‘get’ one or more of the components of LOD – they cannot envisage opening up any of their data, or think that using a web address for an identifier is over complex, or have a religious aversion to RDF.  As a result they dismiss the whole recipe as not for them, or worse still, as something impractical that will become nothing more than the plaything of a few passionate enthusiasts.

When someone who is still struggling with the concept of opening up their organisation’s data; or why RDF might be a more useful format than csv, is shown the ubiquitous Linked Open Data cloud diagram with encouragement to join in – it is hardly surprising they remain a little unconvinced.  This isn’t a criticism of presenters either.  In only 20 minutes on a stage, it is difficult to go into underlying detail.

Let my try in a few paragraphs to break the LOD pavlova in to it’s ingredients

  •  Data – In the context of  this post, by data I mean machine readable information, produced in a format that can be consumed and processed by other machines.  Inevitably, this means file formats such as csv, XML, RDF, etc. , but not something like pdf, html, or word, which although they are in a transferrable format it is designed for human consumption not machine analysis.

    For some, just this step from their current human targeted format, to a machine readable one, is a significant one.

  • Open Data  – Data (see above) which is accessible for all to download, view, and consume in a way that is not encumbered by licensing that restricts its use.  For example, the licensing used by data.gov.uk data.  By definition data which is restricted for certain uses is not fully open.  

    In our internet based world, openness can also be defined in terms of technical accessibility.  If it is only available after a login process, or it is only available to users behind a firewall, it couldn’t be considered as open. 

  • Linked Data – Data (see above) which contains URIs as identifiers for concepts described in the data and URIs to identify the relationships between those concepts.  The four Linked Data Principles, as published as a design note by Tim Berners-Lee, provide a bit more detail on this.

    I am in danger of stirring the embers of a religious fire fight here, between those that believe that Linked Data must be described in RDF and contain URIs as identifiers, and those that maintain that you can have data linked across the web without those constraints.  All I am going to say on that at this time, is that the Linked Open Data cloud of data sets has been successful, based on the first of those two views. (if you want to follow that particular debate in more detail, Paul Miller’s post and associated comments would be a good starting point)

So, how can data be open, but not linked? – by publishing in in a non-Linked Data form such as a text file or a html page or a pdf.  Where would you find this? – all over the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to consume.

How can data be Linked but not open? – by publishing it in accordance with the principles, in RDF, with URIs, but restricting access either by imposing restrictive licensing conditions or restricting access to the data.  Where would you find this? – again all over the web, but often hiding behind restrictive licensing terms such as “non-commercial use only”.  Also to be found inside organisational firewalls.  For example, commercial organisations can realise the benefits of  using Linked Data techniques with their internal private data.  Potentially linking it to publicly visible concepts across the web to add even more value for their employees.

Data that is Linked and Open, like that strawberry pavlova, has the power to deliver value beyond the sum of its individual ingredients.  By providing data in a form that is linked to other data, and easy for others to link to, without restrictions on who or how that linking takes place, provides the foundation for a web of linked data built on the same principles that fostered the growth of the web of documents that has so changed our world over the last decade and a half.

The ingredients that formed that World Wide Web of documents – html, http, open publishing of web sites without restrictions on other’s abilities to consume and/or link to them – individually  were important developments.  However, when those elements were blended together their effects were multiplied many fold and resulted in the web we experience today. 

So [as I stretch my culinary metaphor to it’s limits] if you are hoping to take people with you in building a Linked Open Data future, you not only have to show them a picture of the final dish, you need to describe the individual ingredients and their relevance to the eventual result.

Pictures from Flickr by PhOtOnQuAnTiQuE and avixyz

Wikileaks and the Guardian

I spoke with the Guardian’s Simon Rogers, editor of the Data Blog, about their decision to publish thousands of facts from the Wikileaks Afghan War Diary. In this podcast, Simon introduces Wikileaks and its use by journalists, an reiterates the Guardian’s strategy of publishing raw data alongside stories and comment. During the conversation, Simon explained his perspective on publishing these leaked data and what people can do with it, pointing out that the Guardian doesn’t put any restrictions on reuse of the facts.

One of the major applications of these raw data, especially anything containing geographical information, is the ability to visualise them. One of the first things the Guardian produced from the leaked data was an interactive map of Improvised Explosive Device incedents affecting troops and civilians.

The opening up of the data behind such applications could prove to be a powerful catalyst for wider visualisation and applications built around the presence of authoritative journalistic facts. Putting the raw data in the hands of the web’s hackers has been a bold move from the Guardian, and I hope to see new and better stories come from the tools made possible by a supply of useful information.

Sharing Data on the Web

| This article will appear in Nodalities Magazine, Issue 9.

by Kaitlin Thaney
Program Manager of Science Commons, Creative Commons

Photo 32

In the emerging data web, there have been multiple efforts working towards the same broad goal of data sharing (ie., the NeuroCommons, Linked Open Data, efforts of the World Wide Web Consortium), but are still unevenly distributed. Our understanding of the legal, social and technical issues is increasing, but still is at a very early stage.

This past fall at the International Semantic Web Conference in Chantilly, VA, USA, I joined three other leading minds to lead a tutorial examining some of the legal and social frameworks for sharing data in the emerging data web, focusing on an overview of the need for access, the social issues of applying Free-Libre/Open Source (FLOSS) licenses to data, and the approach we advocate at Creative Commons to help navigate this complex space — converging on the public domain.

Lessons Learned

Creative Commons as an organisation works to make knowledge sharing easy, legal and scalable – with applications in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We maintain an integrated approach, and craft policy and legal tools to lower the barriers to knowledge sharing.

When it comes to data sharing, first and foremost, the information needs to be legally and technically accessible. The Open Access movement has increased awareness to this, using the Creative Commons licensing suite to unlock content, and has seen its share of qualified success. But what to do when the information you want to share and reuse falls outside the protections of copyright?

In short, it’s complicated.

This is the where the discussion of legal protections for data gets murky. Knowledge is not always copyrightable – it may be easy to discern the rights associated with journal articles, but what about data, ontologies, annotations, or research statements described in triples?

The emergence, adoption, and use of the free-libre/open licensing regimes has allowed for remix and reuse of software code, music, film, educational resources and scientific research in a way that otherwise would be difficult to achieve.

The successes of these licensing approaches has caused a change in the social ethos of licensing, instead using a traditional “all rights reserved” model to make something more free, rather than less.

But from our research, this approach is not ideal for data. The trend towards applying licenses, click-wrap agreements and other sorts of restrictions on scientific data is increasing, but with the undesired consequence of limiting the downstream use of this information, and even at times blocking interoperability. The costs are high, the terms are not always clear, nor the protections always legally sound, making it very difficult to scale for scientific uses. The result is a high barrier to entry to do meaningful analysis, annotation, search, etc. on the mass of data available currently that’s continuing to grow exponentially, and integrating with the literature available.

We advocate an approach of converging on the public domain, and requesting behaviours often found in the various flavours of free and open licensing through norms – not a legal construct. But first, let’s take a look at some of the issues to be aware of and their social implications to furthering the goal of linked open data.

Attribution v. Citation

Under US Copyright law, “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”Since facts are not covered by copyright, attribution – a license obligation – doesn’t seem to apply to ideas or facts either, since those rights are conditional on compliance with terms of the license.

Socially, the scholarly concept of citation is fairly well understood – credit where credit us due. It has long been viewed as an entrenched norm of good scientific practice.

But when it comes to the legalities of both terms and how to enact this behaviour, the devil is in the details, and the two are actually rather different when it comes to enforceability and applications / ramifications in the digital world.

In a copyright license, the word “attribution” is a legal requirement, whereas citation evokes more of a club mentality and social practice. Citation in its sole form is not assured or enforceable in the same way, but that’s not necessarily a downside. Ask yourself this, which one is more important – legal enforcement or credit enforced through professional reputation? Attribution – a relatively narrow legal term that can affect interoperability while at the same time possibly failing to provide what you really want? Or citation – an entrenched scientific norm that asks for credit where credit is due.

Implications of FLOSS toggles and directives on data sharing

These issues emerge when instead of focusing on maximizing interoperability of resources, one applies a property metaphor to data. And in the digital world, that tendency can have quite limiting ramifications to future use of the information, as technology continues to outpace the social components to data sharing.

Misunderstanding the legalities can lead to category errors on the social level, including unintentional infringement or on the other side of the spectrum, choosing not to use the resource for fear of infringement. The intentions are often good – believing that applying a less-restrictive copyright license is ensuring the data can be freely shared, reused, and built upon. But without existing precedent or involving a legal team, these issues make for a problematic area to navigate, creating additional confusion and burdens for the users, as well as data providers.

Let’s look at a few examples to gain a better understanding.

Non-Commercial – When used in the context of data, what is a commercial use of the data web? Is it the extraction of a subset, a query that may touch on the data set, hyperlinking?

Attribution – As detailed above, the definitions of attribution and citation are often conflated. Attribution speaks to the legal requirement triggered by the use of the work. But in the case of linked open data, if one were to run a query involving 30,000 data sources (something that is happening every day at an ever decreasing cost), would they then be required to attribute the contributors for all 30,000 databases? You can see how this unintended consequence of attribution stacking could impose a very daunting task for the researcher.

Share-Alike – This toggle specifies that any derivative product be relicensed under the same terms. In the example above of running a large query, all it would take would be one database licensed with a share-alike provision for the entire derivate work to then be under the same terms and no other license. This leads to compatibility issues

There are other external mechanisms and limitations imposed by various jurisdictions and countries that can have a profound effect on data-sharing, especially in terms of international data sharing efforts. These include the sui generis database directive in the European Union, Crown Copyright, “sweat of the brow” and “industrious collection” limitations, trade secrets and unfair competition laws, adding another dimension of complexity to an already complex arena.

After convening a series of meetings, roundtables and other discussions with members of the scientific community, the need emerged for a legally accurate and simple solution, that reduced and/or eliminated the need for one to make the distinction of what’s protected. The conflict between understanding the legal issues and complexities can best be resolved by a two-fold approach: (1) a reconstruction of the public domain and (2) the use of scientific norms to request behaviour through a non-license means.

Converging on the Public Domain (+ Norms)

We believe that the public domain is the best means to achieve maximum interoperability of data with the lowest imposed burdens on the user. This can be achieved through the use of a legal tool – either the Creative Commons CC0 Waiver or the Public Domain Dedication and License (PDDL) – waiving all intellectual property rights asserting that the provider makes no claims on the data. These tools put the work as closely into the public domain as possible.

It calls for data providers to waive all rights necessary for data extraction and re-use (ie., copyright, sui generis database rights, claims of unfair competition, implied contracts). It also requires the provider place no additional obligations such as copyleft or share-alike on the information, which could limit downstream use, as discussed above.

Science Commons also crafted the Protocol for Implementing Open Access Data – a protocol for evaluating database terms of use, in hopes of providing a unified framework for users to evaluate if any given database may be integrated with any other database.

The Protocol recommends one request behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts.

We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.

Final Thoughts

In the early days of the World Wide Web, there weren’t many free-libre licenses available, and after a debate over using GPL for the original web code, CERN chose to put it into the public domain. Getting the law out of the way was key to allow for network effects, and to the success of the Web.

Converge on the public domain and ensure the freedom to integrate. It’s the most scalable solution.

This work is licensed under a Creative Commons Attribution 3.0 License.

Resources

Philip (Flip) Kromer talks about InfoChimps and building a data marketplace

In my latest podcast I talk with Flip Kromer, co-founder of InfoChimps.

We explore the background to InfoChimps, and discuss their aspiration to build a marketplace in which people can contribute and find data – both freely available and commercial.

data.gov.uk and the Talis Platform

Earlier this year Gordon Brown appointed Tim Berners-Lee as an advisor to the Cabinet Office to help the government begin the process of opening up its data. This was one part of the initiation of a project to begin opening up UK government data in a similar style to the US. A key part of Berners-Lee’s vision for putting government data online has been Linked Data which promises to provide a much richer way for citizens to begin accessing, browsing, and using government data.

Several other governments have begun opening up data assets including Australia and New Zealand. These approaches mirror that of the US data.gov site, providing a browsable directory of datasets and links to raw data downloads in a range of different formats. The preview launch of data.gov.uk which was announced at the end of September also includes a directory of datasets which is powered by the software underlying the Comprehensive Knowledge Archive Network. But the site also aims to fulfill Berners-Lee’s vision and in addition provide access to some datasets as Linked Data through SPARQL endpoints.

We’re very pleased to report that the Talis Platform is currently underpinning the delivery of all of the Linked Data and SPARQL endpoints for the data.gov.uk site.

We’ve been quietly supporting the effort for several months now helping out with data management, modelling discussions, and with training on the core technology. There seems to be a very definite appetite in government to not only open the raw data but to also explore the potential for Linked Data. Its clear from today’s announcement about opening up additional aspects of the Ordnance Survey data that there’s a real focus on delivering on the open data promise. While there are certainly some high-profile datasets like the Ordnance Survey or postcode data that may require legislative changes to become open, one of the biggest implementation challenges facing government is pulling together an overall directory of datasets and spreadsheets that are already scattered across multiple departmental websites.

Creating a dataset directory provides the required basic level of infrastructure to allow reuse, by enabling developers to find what they need; publishing Linked Data, SPARQL endpoints, and potentially extra APIs provides an additional set of options for ways to access the data. By letting datasets be browsable by anyone, not just developers, Linked Data offers the potential for anyone to find, discover and reuse interesting datasets. As I illustrated in a recent talk, these approaches are not mutually exclusive and the goal should be maximum utility.

Over on the Talis Platform developer blog we’ve begun showing some ways that the initial datasets, covering UK schools and traffic measurements can be queried in interesting ways. Its been exciting to see people begin to pick up the technology and creating reporting tools to explore the data, but also fantastic to be able to easily view data using only a browser.

There’s clearly still a great deal of work ahead, but the ground work has now been completed: there’s infrastructure in place to support data publishing; official guidelines on creating public sector URIs; and some agreement on best practices for modelling statistical data. The next challenge is to start ramping up the conversion of currently open data into RDF, in order to begin expanding the coverage of the Linked Data.

This is a very exciting project and here at Talis it’s something in which we’re very proud to be playing a role.