Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Talis Sponsor Pan-European Open Data Challenge

opendatachallenge We are proud to be a Lead Sponsor for the Open Data Challenge being coordinated by Jonathan Gray from the Open Knowledge Foundation and Paul Meller from the Open Forum Academy, under the auspices of the Share PSI initiative.

This is a significant competition, with significant prizes totalling €20,000 for ideas, applications, visualisations and datasets – up to €5,000! 

As you would expect from a Talis Sponsored competition, Linked Data features in the line up of attributes that entrants should be considering.   Following the 5 Star Data principles espoused by Sir Tim Bereners-lee, the more machine readable, non-proprietary formatted, and linked that Open Data can be, the lower is the barrier to its innovative use.   This is especially true in the area of Public Sector Information, with similar or associated data is being published by several organisations or governments.  In recognition of this we are, as part of our sponsorship, backing the Talis Award for Linked Data – €1,000 presented for the best use of Linked Data in any of the competition categories.

The competition will run for 60 days, so get your ideas flowing, and developers fingers rattling over those keys.

Watch out for a later post, when I will  identify some Linked Open Data that is already available that you could use to build an entry.

Linked Data and Libraries 2011 – July 14th

bl1 After the great success of Linked Data and Libraries 2010 we are doing it again!

Linked Data and Libraries 2011 will be held at The British Library in London on Thursday July 14th.  Again it will be a free event, with limited spaces allocated, so register early.

The agenda is yet to be finalised, but as per 2010 it will be a mixture of general Linked Data overviews & experience, and library Linked Data speakers.  We hope to hear from the British Library, W3C Library Linked Data Incubator Group, LOD-LAM Summit, and others. We are also hoping to find time for the 10 minute lightening talks slot, that worked so well last time.

Register early and/or if you would like to propose a topic or speaker, email me – richard.wallis@talis.com.

Image from a photo on Flickr by Fuzzyyol

Are We Getting A Right to Data?

Friday night – nothing on the TV – I know! I’ll browse through the Protection of Freedoms Bill, currently passing through the UK Parliament. Sad I know, but interesting.

Government spending data published %007C Number10.gov.uk Lets scroll back in time a bit to November 19th 2010 and a government press conference introduced by a video from Prime Minister David Cameron.  The headline story was about the publishing of government spending and contract data, but towards the end of this 109 second short he said the following:

… the most exciting is a new right to data. Which will let people request streams of government information and use it for social or commercial purposes.  Take all this together and we really can make this one of the most open, accountable and transparent governments there is.  Let me end by saying this. You are going to have so much information about what we do, how much of your money we spend doing it, and what the outcome is.  So use it, exploit it, hold us to account.  Together we can set a great example of what a modern democracy aught to look like. (my emphasis)

Obviously to realise this Right to Data there needs to be some legislation, which brings me to the Protection of Freedoms Bill.  This is one of those bills which covers all sorts of issues, from rules for destruction of fingerprints and DNA profiles, CCTV camera regulations, detention of terrorist suspects, to freedom of information and data protection.  Zooming in on the bits on the topic of the release and publication of datasets held by public authorities, we find a set of clauses that amend the Freedom of Information Act 2000

Re-use

After some amendments which allow for datasets and provision in electronic form we get this: “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.”  Unfortunately there is no definition of the term re-use.  It could be argued that a pdf of some tables in a MS Word document could be re-used, where as I believe the spirit of the legislation should be made more explicit to by identifying non-proprietary data formats.  I know this would be a tricky job for the parliamentary draftsmen, as we would not want to restrict it to things, such as XML and csv, that could age and be replaced by something better which then could not be used as it had not been mentioned in the legislation, but I believe that just using the term ‘re-use’ is far too woolly and open to [mis]interpretation.

What is [not] a dataset

This is one of the areas that raises most concern for me. Checkout this wording from the Bill:text1 I am OK with (a) – data collected as part of an authority doing it’s job – and (c) – don’t change the data you have collected – publishing that raw data is important.  However (b) specifically excludes data that is the product of analysis.  Presumably analysis of collected data is one significant way that an authority measures the outcomes of its efforts.  Understanding that analysis will help understand the subsequent decisions and actions they make and take.  I assume that there may be some specific reasons that underpin this blanket exclusion of analysis data.  If there are, they should be identified, instead of generally throttling the output of useful data that will go a long way to helping with Mr Cameron’s stated ambition for us to be able to see “what the outcome is” of the spending of public money.

Release of datasets for re-use

This is a whole new section (11A)  to be added to the 2000 act to cover the release of datasets. It covers ownership, copyright, and/or database right of the information to be published and states that it should be published under “the licence specified by the Secretary of State in a code of practice issued under section 45”. Section 45 basically puts in to the hands of the Secretary of State the definition of the license(s) data should be published under.  As of today the Open Government Licence for public sector information is what is wanted to keep the publishing of information open.  However, what is there to stop a future Secretary of State, who has a less open outlook in replacing it with far more restrictive licences?  Do we not need some form of presumption of openness being attached to the Secretary of States powers as part of this change in legislation?

On the topic of presumptions of openness, the wording of this bill contains phrases such as “unless the authority is satisfied that it is not appropriate for the dataset to be published” and “where reasonably practicable”.  It is clear that many in the public sector are not as enthusiastic about publishing data as the current government position and such vague phrases as these may well be unreasonably used by some in justifying a throttling of the stream of information.   They could easily be used to build in a bureaucratic decision hurdle for each dataset to have to jump, proving its appropriateness and practicality, before publication.  I am sure that it would not be beyond a parliamentary draftsman’s skill to produce wording that means that all will be published, unless a specific objection is raised for an individual dataset, for reasons of excessive effort or data protection reasons.

Up-dated data

Data published by an authority should be published under a scheme, the following applies here:Protection of Freedoms Bill (HC Bill 146)How should we interpret “any up-dated version held by the authority of such a dataset”? My interpretation is that once a dataset has been published is shall continue to be published as it changes.  The precedent for this is spending data – having published authority spending for January 2011, authorities should be automatically publishing it for February and following months.  But what if, in response to a request, an authority publishes the contents of a spreadsheet used to track the amount of salt applied to roads in its area during winter 2010-11 and then uses a different spreadsheet for the following winter.  Does the output of that new spreadsheet constitute a new dataset, or an up-date to it’s predecessor?  From the wording in the Bill it is not clear.

Who does it cover?

I probably need a bit of help here from those that understand the public sector better than I do, but I am suspicious that references to the organisations listed in Schedule 1 and “the wider public sector”, do not take the net wide enough to cover some of the data that is relevant to our daily lives but is delivered on behalf of some authorities by third parties.  For example I am aware that recently a large city was not able to inform citizens of their rubbish collection schedules because that data was considered as commercially restricted by their service provider.

 

So in summary, I welcome the commitment to a right to data being realised by streams of government information about what we do, how much of our money is spend doing it, and what the outcomes are.  However, I am sceptical as to how effective the measures in the current Protection of Freedoms Bill will be in delivering them.  Especially in the light of very recent comments made by the Prime Minister highlighting the "enemies of enterprise" in Whitehall and town halls across the country, attacking what he called the "mad" bureaucracy that holds back entrepreneurs.  Those enemies are just the people who might take the wording of this bill as ammunition in their cause.

mug Whilst being concerned about this topic, I have been wondering why few are commenting on it.  Are the majority just taking the press conference statements by David Cameron, and his fellow Ministers, as indications of a battle won, or am I missing something?  I promote Sir Tim Berners-Lee’s 5 Star Data as the steps towards a Web of Linked Data – if we don’t get the publishing of public sector data to at least 3 star standard (Available as machine-readable structured data – in non-proprietary format), many of the current ambitions may remain just that, ambitions.  That would be a massive missed opportunity. 

So are we getting a right to data? – or just some provisions to extend the Freedom of Information Act a bit further in the dataset direction?  I’m not sure.

Personal note: As you may tell from the above, I am no expert on the interpretation of parliamentary legislation, and I have left several unanswered questions hanging in this post.  Any help in clarifying my thinking, confirming or disproving my assumptions, or answering some of those questions, will be gratefully received in comments to this post or your own posted thoughts.

Linked Spending Data – How and Why Bother Pt3

linkedlocalgovAs often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted.  So my apologies to those eagerly awaiting this latest.

To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all.  In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.

I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion.  So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies  identifiers.

Ontologies

An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain  -  an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations.  So in our quest to publish spending data what ontology should we use?  The Payments Ontology, with the accompanying guide to it’s application, is what is needed.  Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc.  The next question is how do you identify the things that you are relating together using this ontology.

Lets take this one step at a time:

  1. Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
  2. Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’.  This enables anyone wanting detail about this item to follow the link to our site to get the information.
  3. Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology:
    http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payer
    http://statistics.data.gov.uk/id/local-authority/41UD .

    Note I am using an identifier for the payer that is published by statistics.data.gov.uk.  That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.

  4. Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payee
    http://spending.lichfielddc.gov.uk/supplier/bristow-sutor .
  5. And then repeat the process for categorisation, payment value etc.

This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number!  – there are, but as of the moment there are no established sets of URI identifiers for these.  OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so.  Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier:
http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs
http://opencorporates.com/companies/uk/01431688 .

Identifiers

A_Colorful_Cartoon_Chicken_Laying_a_Golden_Egg_Royalty_Free_Clipart_Picture_100705-004451-507053 Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation.   I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670 

You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later.  Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616.  If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.

It does matter however, for those looking to the identifying URI for some idea of authority.  As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying.  Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information.   The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit  links between URIs at a later date.

In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many.  An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment.  Yes it is worth bothering, just don’t expect benefits on day, or even month, one.

 

 

 

Linked Spending Data – How and Why Bother Pt2

linkedlocalgovI started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption.   So learning from that experience I am going to start with another assumption in this post.  Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing.  Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.

In this post I am going to be concentrating more on the How than the Why Bother

homeTo help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples.  Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/.   On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council. 

However, it is more than a web site.  Inspection of the Download data tab shows that you can get your hands on the source data in csv format.  Here is one line, representing a line of expenditure, from that data:

"http://statistics.data.gov.uk/id/local-authority/41UD","Lichfield District Council","2010-04-06","7747","http://spending.lichfielddc.gov.uk/spend/8605670","120.00","BRISTOW & SUTOR","401","Revenue Collection","Supplies & Services","Bailiff Fees",""

… which represents the data displayed on this human readable page:

Lichfield District Council Spending Data - Details of payment number 8605670
Looking through the csv, you can pick out the strings of characters for information such as the date, supplier name, department name etc.  In addition you can pick out a couple of URIs:

Linked Data for Lichfield District Council %007C statistics.data.gov.uk In the context of csv, that’s all these URIs are, identifiers.  However because they are http URIs you can click through to the address to get more information.  If you do that with your web browser you get a human readable representation of the data.  These sites also provide access to the same data, formatted in RDF, for use by developers.

Source of http___spending.lichfielddc.gov.uk_spend_8605670.rdf You can see that data by adding ‘.rdf’ to the end of the address, thus: http://spending.lichfielddc.gov.uk/spend/8605670.rdf and then selecting the ‘view source’ option of your browser for the page of gobbledegook that you get back.  

Inspecting the RDF, you will see that most things, except descriptive labels and financial values, are are now identified as URIs such as http://spending.lichfielddc.gov.uk/subjective/bailiff-fees and http://spending.lichfielddc.gov.uk/invoice/7747.  Again if you follow those links, you will get a human readable representation of that resource, and the RDF behind it by adding a ‘.rdf’ suffix.

The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things.  Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.

Secondly, you may have spotted that there is no date, or supplier name or identifier.  That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.

BBC - Wildlife Finder - Whooper swan facts, pictures & stunning videos Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc.  This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to.  Another great example of this effect can be found on a site such as the BBC Wildlife Finder.  That is not to say that you could not have created such a site without even considering Linked Data, of course you could.  However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.

The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority.  By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.

So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s?  The simple answer is that you may not have to – others may help you do some, if not all, of it.   With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are.  As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source.  Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.

As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this.  But that is for Part 3….

Linked Spending Data – How and Why Bother Pt1

linkedlocalgovNational Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months.  For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.

This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start? 

I believe that spending data is a great place to start, both for publishing local government data and for making such data linked.  Someone at national level was quite astute choosing spending as a starting point.  To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category,  a payee, and an amount.  At a very basic level it is very easy to measure if an authority has done that or not.

Guidance from data.gov.uk expands on this a little by mandating the following:

  Body This should be the URI that represents (or more properly ‘identifies’ – see below) the local authority at statistics.data.gov.uk.
eg. http://statistics.data.gov.uk/id/local-authority-district/00CN
  Date Should ideally be the payment date as recorded in purchase or general ledger
  Transaction number To identify within authority’s system, for future reference
  Amount In Sterling recorded in finance system
  Supplier Details Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
  Expense Area The part of the authority that spent the amount
  Service Categorization

Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.

… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.

As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format - supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.

So fairy clear, and measurable, then. You either have published your required basic elements of data in a CSV format file, or you have not.  Couple this with the political ambitions and drive behind the Government’s Transparency Agenda, and local authorities will have difficulty in not delivering this.  Although some are being a bit tardy and others seem reticent to publish in formats other than pdf.

OK so why bother with applying Linked Data techniques to this [boring] spending data?  Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent.   Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.  

For those new to Linked Data, the use of URIs as identifiers often causes confusion.   A URI, such as  http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans.  It has couple of attributes that make it different from traditional identifiers.  Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier.  This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion.  Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies.  In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.

Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How)  leads to a Linked Data representation of the data.  In technical terms a comparatively easy step using scripts etc.

By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.

If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities.  Which brings me to ….

linkeddata_blue Why bother publishing any local authority data as Linked Data?  Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government.  For example, the data held at statistics.data.gov.uk includes which county an authority is located within.  By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.  

As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc.  Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data  platform for innovating development and the publishing of other data that will link in with this basic but important financial data.

There are however some more technical issues, URI naming, aggregation, etc.,  to be overcome or at least addressed in the short term to get us to that foundation.  I will cover these in part 2 of this series.

Thanksgiving for Open Government

On the eve of the American Thanksgiving holiday, millions of people travel to spend time with friends and family.  Before I share a meal with relatives, I contemplate the connection between the first thanksgiving and the emerging Open Government movement.

The “First Thanksgiving” celebration in the US was a feast shared by 53 starving pilgrims who survived a brutal winter in New England, and 90 Native Americans.  The Native Americans knew how to manage their land and waters to provide sufficient fish, meat, vegetables and fruit.

The connection between the first American Thanksgiving and Open Government has to do with adapting to a new world by sharing information.  Four hundred years ago,  the Native Americans shared information on seeds, crops and planting conditions, helping the pilgrims survive.  Today, sharing information via the Web is helping us to better understand climate conditions, our health care options and issues impacting our local community.

Last week I joined about 250 people at the first International Open Government Conference, hosted by the US Department of Commerce in Washington DC.  Approximately half the conference delegates were from government, the balance from academia and the private sector.  The speakers discussed Open Government projects underway in the US, UK, Australia, New Zealand and Brazil. Speakers shared success stories and areas for future development.  The common theme: democratizing public sector data and driving innovation.  Jonas Rabinovitch from the United Nations Department of Economic and Social Affairs highlighted several eGov strategies in developing nations.  Mr. Rabinovitch noted that all but three UN member nations have a basic Web presence, many offer online forms and some provide the ability to perform transactions via the Web.

Given the conference was hosted in the US Department of Commerce, data.gov featured prominently.  “The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”  Seven countries have stood up Open Government sites in the last 18 months, including UK, US, Australia, New Zealand, Canada and Finland.  Government administrators are seeking to restore public trust and establish an environment of transparency, participation and collaboration with the public.

The US Administration launched its Open Government Initiative in April 2009.  In the last two years, I’ve watched the US Executive Branch begin to move from  a “need to know” to a “need to share” culture.  This cultural transition and thus this Open Government Conference, was truly historic.  The conference underscored to me that we all, regardless of our political views and affiliation, live in a highly  interconnected global economy, underpinned by the World Wide Web.

Respected advisors on Open Government initiatives including Professor Jim Hendler of Rensselaer Polytechnic Institute and Sir Tim Berners-Lee, Director of the World Wide Web Consortium, agreed that public participation and collaboration will be key to the success of Open Government initiatives.  I believe that more conferences like this one and the Open Government Data Camp 2010 held in London last week, drawing delegates from a variety of disciplines, from several countries, will do a great deal to reinvigorate civic engagement and economic growth from the ground up.

Government employees are responding to mandates to publish content to Open Government websites.  Data.gov was launched in April 2009 with 47 data sets.  Vivek Kundra, U.S Chief Information Officer stated that data.gov has in excess of 300,000 data sets as of November 2010.  A large portion of the data.gov data sets are geospatial information which is an opportunity for scientists and entrepreneurs to build tools for analysis and visualization of this valuable data.  The UK Government as published over 4,600 data sets, including many from Great Britain’s national mapping agency, Ordnance Survey, providing the most accurate and up-to-date geographic data for the UK.

“The stakes are high for our interlinked global economy.”  Dr. Robert Schaefer, Deputy Project Scientist from Johns Hopkins University Applied Physics Lab gave a compelling presentation on the need for mechanisms to make sense of published data as Linked Open Data. Publishing the content as in RDF is not sufficient, rather, providing context on what the data implies is necessary.  Better tools for analysts and scientists to extract meaning from Linked Open Data will allow critical information on climate change and space weather, for example, to be more readily understood by policy makers.  Professor Schaefer stated the implications for climate change are serious, wide ranging & urgent.  Current CO2 emissions are higher than the International Panel of Climate Change “worst case” scenario.  Billions of people may experience serious consequences from climate change.  Professor Schaefer reiterated the need to get started as soon as possible.  “When the water from the sea rises, millions of people will have to move.”  This international conference will hopefully stimulate cooperation between the public and private sectors.  It is a critical step in making data accessible and providing decision support tools for space weather and climate change.

Mr. Kundra acknowledged we have much more to do to improve the quality of published data sets.  He said, “when I’m able to perform analytics on the fly, grounded on quality data, we will have achieved success.”  Delegates were encouraged by Mr. Kundra and  other speakers to build out communities of interest, lead by individuals, rather than government agencies. The US Government is regularly launching challenges, see http://www.challenge.gov, with modest cash prizes targeting citizens to gain insights on how we, the people, not government, can solve problems ranging from education on childhood obesity to sustainable urban housing that respects the environment.

Beth Simone Noveck, United States Deputy Chief Technology Officer for Open Government, leads President Obama’s Open Government Initiative.  Based at the White House Office of Science and Technology Policy, she is an expert on technology and institutional innovation. Ms. Noveck stated that “the Open Government Initiative is not transparency for transparency’s sake.  It is through participation and collaboration with academia and the public sector that there is value.”  Creating partnerships to use Open Government Data for important and unforeseen uses is empowering individuals with the ability to make better decisions and affect our quality of life.

We are in the very early stages of making Open Government available as Linked Data. Today, we are in the very early phases, however,  there are many good reasons to support Open Government initiatives including accountability in spending, improved health care provision, and addressing climate change and space weather which affects the world’s population.   The international data exchange standards are in now in place.  While experts will continue to refine the technical underpinnings and best practices will evolve, the citizen lead movement, assisted by government, is truly underway.

Bright young geeks are increasingly involved in American civic life through non-profit organizations like Code for America.  Passionate entrepreneurs like Dan Melton show that being being super bright and engaged at a grassroots level in government is both hip and necessary.  Code for America recruited twenty “fellows” from 362 applicants to get involved in city projects in 2011.  One example discussed was the Boston Project whose idea is to bring info on students together & create interesting applications leveraging federal census content, student data, transit info, city and state data.

Each month new mobile applications and social networking solutions are made available.  These are not expensive, government top down initiatives, rather, they are coming from the ground up by military personnel, students, local government officials, publishers, scientists and citizens who value transparent government.  An interesting mobile app for Android, iPhone and the iPad was unveiled for the New York Senate.  It is a real-time constituent mobile dashboard to the legislative process allowing citizens to connect with Senators, find and comment on bills, review votes and transcripts.

Academics are doing innovative research.  Grad students and post-docs are rapidly prototyping what the new world of open data will look like.   An increasingly number of software companies, including my employer Talis, are producing light weight platforms and cloud computing solutions.  Thousands of smart people have been creating the foundation of the Linked Data “ecosystem” in the form of International Data Standards and best practices over the last fifteen years, largely through the important work of the World Wide Web Consortium (W3C).

The availability of improved development tools is seen as a requirement for widespread proliferation of Semantically enabled applications, however,  people are leveraging international standards such as RDF for Linked Data, content sharing models, well-documented licensing models, and existing best practices.  Fully 25% of the applications shipped on a new Apple iPhone use government produced content.

I believe there are significant opportunities for commercial software firms to produce services and products to visualize data sets, find related data sets and most importantly, provide mechanisms as easy to use as the early Web to publish machine and human readable data as Linked Data.  There is burgeoning information economy rapidly forming around provision of public and private data mixed together in novel ways.  I believe that in 2011, truly useful tools for Web developers to create compelling Linked Data applications will be available for use with Open Government data.

We should all acknowledge that data will never be 100% perfect.  Real data is dirty, face it.  Yes, concerns will linger about misinterpretation and inappropriate mashups until people gain experience in making informed decisions based the data presented.  Be patient and don’t expect it to be perfect on day one or even year one.  Allow best practices to emerge from the ground up, by communities of interest.  Issues of data quality, provenance, context and important elements such as units of measure will all be addressed as Linked Data becomes more mainstream.  Harvard Business School published a blue print for use of open government data.  The W3C provides lots of useful guidance on eGovernment and Linked Data activities.

Just as the early American pilgrims experienced miscalculations in weather and agriculture, they eventually they figured out how to plant seeds correctly and increase their potential for a bountiful harvest.  Through information sharing and discussion by informed citizens, the US evolved a free and democratic form of government that is admired by millions of people around the world.

I’m optimistic that the citizens of the world will leverage Open Government initiatives for positive outcomes.  The more our governments support openness and transparency through Open Government initiatives, the more we, the people, can solve issues that matter at the community-level or on a global level.  The stakes are high and we should be grateful and cooperate to harness the power of Open Government data and the Web.  We are defining our history, as well as our future, today.

Working on Plings

logo for PlingsIt’s always good to work on projects that aim to make a difference and to contribute something: you could say we look for projects with some substance to them. So, it’s been fun to work with social research company, Substance on their Plings project. If you’ll forgive the opening pun, I’ll explain a bit.

The Plings project aims to gather together the best available information about “positive activities” for young people: PLaces to go + thINGS to do = PLINGS. Substance describes Plings as: a search and discovery tool that helps people to find accurate and trusted sources of information about positive activities for teenagers. So, I can look for Plings around Talis’ Birmingham offices, and find out about football coaching, cafes, dance and musical projects: all happening within a set radius of my postcode. It’s a versatile tool, letting the searcher facet their results and customise the display, and it also ties in with social networks (check out the fantastically-named “boredometer” for example.), and devices.

Feeding the Plings site is a dataset comprising two main parts:

  • Data on the actual activities: places to go, things to do
  • Data on feedback relating to the activities: “Plingback”

Substance uses various methods to collect the first dataset, routing it through their own API. This lets them use data from many different formats and shapes: from local authorities, third sector and community groups and the private sector. For Plingbacks, though, Talis has been working with Substance to create an infrastructure that can be used to generate data in RDF which Talis hosts through it’s Managed Service. There is a bit more about the Plingbacks app on appspot for more detail, too.

In short, the Linked Data approach enables Substance to have multiple Plingback widgets that can be presented through multiple channels. Because they all share the same API and data structure, they can use the Talis Platform to query and visualise the data dynamically.

Substance’s Steven Flower also told me a bit about a related project building on the back of Plingbacks and the Talis Platform called Plingalytics: a sort of dashboard enabling local authorities and stakeholders to get a very useful view of the Plings datasets. It will let them answer questions like: “How many Plings do we have on a Friday night?” or simply: “What’s hot? What’s not?”

This ties in with another side of Plings, which works with local authorities to “fulfil their statutory duty to publicise and keep up to date comprehensive and accurate information on positive activities for young people and to make it accessible,” according to Substance’s site.

It’s an exciting project to be working on, and I’m very interested in the way it ties in local government, young people, and activities through a very positive use of the Web. The fact that they’re using Linked Data to back the interrelated data makes a lot of sense, and we’ve been working together for a long time pulling together Linked Data opportunities and matching them with solutions. Alongside looking after the Plingback dataset on the Platform, our consulting team has worked with Substance to model and convert their data to RDF. In addition, and because of the open nature of the data Substance is working producing, Plings is able to make use of Talis’ Connected Commons scheme for some of its data: meaning that not only can this information be managed free of charge, but it’s available on an open data licence.

Steven Flower said: “We are very excited about this. From a technical point of view, the opportunity to build this upon Linked Data sets is also interesting. Hence, we have chosen to work with Talis for the infrastructure, knowledge, support and enthusiasm that they bring.

We have had the support of Talis since early days of Plings, so it’s good to continue.”

More information on the Plings project from Substance can be found on their Plings info page.

“Linked Data” at the Guardian

Nodalities Magazine article by Martin Belam.

During October at Guardian News & Media we announced a change in our Open Platform Content API. For the first time, developers and users could query our database of over 1 million content items by using the common external identifiers of a MusicBrainz ID or an ISBN number. It is our first step into the world of ‘Linked Data’.

The Open Platform Content API was launched as a beta in 2009, and earlier this year was launched as a commercial product, allowing partners to re-use Guardian & Observer content in a variety of different ways. There is, for example, a WordPress plugin that easily allows you to include Guardian content in your blog, and developers have built applications like a bespoke recipe search on top of the data. It is a unique proposition amongst news organisations on the web, and as well as the Content API itself, the Open Platform also includes publishing the source data behind Guardian journalism on the Data Store, and providing a search engine for Government datasets from around the world.

Why linked data at Guardian News & Media?

The addition of linked data to the API is the culmination of a great deal of work behind the scenes to get the data prepared, and to work out the right way to make it available. Personally, I had been struck the first time I saw the linked open data cloud diagram that none of the bubbles represented any of the UK’s traditional print news organisations. With our combined centuries of experience sifting, collating, organising and publishing information, it seemed to me that they should in fact be occupying a central position on that map. The principles of linked open data also chime with the over-riding principles we have about our web presence at Guardian.co.uk. We strive to be ‘of the web’, not just on the web. That means reaching out and embracing external services and data, and our intention is to have permanent, predictable URLs for all of our content.

The first challenge to implementing this was to pick stable and reliable external datasets that would form a permanent and meaningful relationship with our content. We decided that a focus on distinct cultural entities would work, and avoided the messiness of trying to decide whether a story was ‘about’ something, or whether it just ‘mentioned’ something. MusicBrainz IDs and ISBN numbers seemed like datatypes we could work with.

The domain model of our content already had a concept of an ‘external reference’ that can be added to a tag or a factbox or an article. We have previously used that to link articles to a page about a specific film, or to link a sports match report to game statistics provided by a third party like Opta. The obvious route was therefore to expose these ‘external references’ in our API

MusicBrainz IDs

musicbrainz ID in the APIWith MusicBrainz IDs, we did not attempt to tag all of our music story archive. There are around 42,000 music content items currently on our site, and to accurately add MusicBrainz IDs to them would be an arduous task. Fortunately, because of our domain model, we had a shortcut to tagging this content. All of the items in our database are given tags. These indicate the type of content (e.g. article, audio, video), the tone of content (e.g. news, comment, review, obituary), the contributor who produced the content, and keywords representing the subject the content is about. In the Music section, we have around 600 of the artists we write about most frequently who exist as keyword tags. The quickest route to adding MusicBrainz data was to add it to these artist keyword tags. The actual job of tagging was achieved via the rather dull mechanism of filling in a Google Docs spreadsheet, although developer Daithi Ó Crualaoich built a tool to help us. He came up with a quick browser-based hack that simultaneously put the same search string across our music tags and across MusicBrainz, and matched the outcome. A script then uploaded this to our database.

ISBN numbers

ISBN numbers were another obvious choice for us. The majority of our book reviews on the web feature a ‘fact box’, giving details of the publication and a corresponding link through to our book store to make a purchase. This ‘fact box’ frequently includes the ISBN number of the publication, and so exposing them as a search criteria was not a massive undertaking. Nevertheless, as with our music content, we do not have universal coverage. At the time of launch around 2,500 reviews out of a possible total of 17,000 had ISBNs attached to them. This is part of the production process now, and so all reviews going forward should have the ISBN added.

API query types

Open Platform API ExplorerThe Open Platform supports a range of ways to query this data, and you can find a guide at: http://gu.com/p/2k6ay. Obviously you can query the API looking for a specific reference, so a query for reference=musicbrainz/05ec70a5-3858-4346-a649-fda0a297b8c1 will return content about Shirley Bassey. Additionally, you can get a list of content which has a MusicBrainz or ISBN attached to it, so reference-type=musicbrainz|isbn will give you content from the API which has a MusicBrainz OR an ISBN added to it. Adding the ‘show-references’ parameter will return a block in your API responses that includes MusicBrainz IDs or ISBN numbers for any item within the list. If you’ve not used the Guardian’s API before, you can get a feel for how it works by using our browser based API explorer.

‘Linked data’ formats

It does seem that as soon as you put the words ‘linked’, ‘open’ and ‘data’ into the same sentence, you automatically invoke a debate about what formats are appropriate to use. At the present time we are making these persistent external IDs available alongside our content items in both XML and JSON formats. And yes, that does mean that we have steered away from RDFa and SPARQL.

From our point of view there is a clear reasoning behind this. We try to work in a lightweight and agile way, and providing the data in this format was the simplest way to meet our immediate requirements. We are trying to concentrate on making more metadata available. If we were to decide to invest in triple-stores and implement a SPARQL endpoint first, then I’d wager that we would still be waiting to dip our toe into the water.

Moreover, it would be wrong to commit our editorial production colleagues to tagging up all our content with this extra layer of semantic data, if we can’t show the benefits. It is my hope that by incrementally releasing extra layers of linked data through our API, in a simple way, we can see what works and what doesn’t, and what types of data interest people and inspire them to develop applications using the data

As I’ve personally argued before, particularly in response to Tom Coates’ recent call for “Death to the Seamntic Web”, I’m entirely agnostic about formats myself. What I think is most important is that we provide consistent, RESTful, predictable, persistent hooks into Guardian.co.uk content, in as many ways as possible, with the right licence for re-use.

What next?

We are now evaluating where else we can add value to our API with joins to external datasets. Again we will aim to be pragmatic—tagging the most amount of data with the least amount of effort. And we also want to listen to the linked data community—what are the data joins that would be most useful to external developers?

Martin Belam is an information architect at the Guardian newspaper.

Linked Data – Coming Together

hannibal To quote John ‘Hannibal’ Smith, from that wonderful bit of 1980s TV, “I love it when a plan comes together!”.   Of course aficionados of the A-Team will probably remember ‘the plan’ was often only apparent in retrospect, although it’s general intention was clear from the start.

The adoption of  Linked Data and the realisation of all that potential benefit, is looking a bit like an A-Team episode – the eventual outcome being clear from the start, but with many setbacks, skirmishes to fight, partners to woo, nerves to calm, and teams to lead on the way.

To break the metaphor at this point, I see Linked Data as more of a shared vision than a plan laid out before us.  Nevertheless, I think we are staring to see elements of it ‘starting to come together’.

One very obvious example, is what Ordnance Survey is doing by continuing to open up their location data.  Now that OS have defined a URI for every UK postcode unit [eg. ‘SO16 4GU’ = http://data.ordnancesurvey.co.uk/id/postcodeunit/SO164GU], why would anyone [re-]publishing data in the future not use these identifiers to reference their postcode information?  By that simple step they will be linked in with a wealth of ancillary information about the location – easting/northing, ward, district, county, country, etc.

Goodwin BIS Great I hear you say, but show me an example of what that could lead to!  Being lazy, I’ll let the inimitable John Goodwin of the OS do it for me.  In his recent appropriately named “So what can I do with the new Ordnance Survey Linked Data?” post, he shows how by merging data from a previous Talis project, produced for the Department of Innovation and Skills, he can deliver a very different way of accessing the same data. 

The BIS Research Funding Explorer project brought together data about UK Government research funding, from several research councils and the Intellectual Property Office, and brought them together in a Linked Data driven application to display UK centres of research excellence. 

John explains how by mixing Linked Data, published for that project, with OS Linked Data, he has been able to develop a different way of accessing the data.  In his, prototype, application you are presented with a map of the UK showing the regions as defined by the European Union.  By clicking on one of the EU regions you are presented with a list of the projects from within that area.  He has also added the ability to access by county or District/Unitary Authority. A simple, but effective, way of demonstrating that data, in Linked Data form, from one source can be easily combined with data from another source to deliver benefit.

Of course even with this example we are seeing the effect of joining just a couple of jigsaw pieces together.  With Linked Data, such as this from OS, being published at an ever increasing rate, it will not be long before a bigger picture starts to form as more and more data pieces are linked together.

I love it when you can see a plan coming together!