Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Linked Data and Libraries 2011 – July 14th

bl1 After the great success of Linked Data and Libraries 2010 we are doing it again!

Linked Data and Libraries 2011 will be held at The British Library in London on Thursday July 14th.  Again it will be a free event, with limited spaces allocated, so register early.

The agenda is yet to be finalised, but as per 2010 it will be a mixture of general Linked Data overviews & experience, and library Linked Data speakers.  We hope to hear from the British Library, W3C Library Linked Data Incubator Group, LOD-LAM Summit, and others. We are also hoping to find time for the 10 minute lightening talks slot, that worked so well last time.

Register early and/or if you would like to propose a topic or speaker, email me – richard.wallis@talis.com.

Image from a photo on Flickr by Fuzzyyol

Linked Spending Data – How and Why Bother Pt3

linkedlocalgovAs often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted.  So my apologies to those eagerly awaiting this latest.

To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all.  In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.

I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion.  So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies  identifiers.

Ontologies

An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain  -  an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations.  So in our quest to publish spending data what ontology should we use?  The Payments Ontology, with the accompanying guide to it’s application, is what is needed.  Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc.  The next question is how do you identify the things that you are relating together using this ontology.

Lets take this one step at a time:

  1. Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
  2. Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’.  This enables anyone wanting detail about this item to follow the link to our site to get the information.
  3. Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology:
    http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payer
    http://statistics.data.gov.uk/id/local-authority/41UD .

    Note I am using an identifier for the payer that is published by statistics.data.gov.uk.  That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.

  4. Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payee
    http://spending.lichfielddc.gov.uk/supplier/bristow-sutor .
  5. And then repeat the process for categorisation, payment value etc.

This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number!  – there are, but as of the moment there are no established sets of URI identifiers for these.  OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so.  Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier:
http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs
http://opencorporates.com/companies/uk/01431688 .

Identifiers

A_Colorful_Cartoon_Chicken_Laying_a_Golden_Egg_Royalty_Free_Clipart_Picture_100705-004451-507053 Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation.   I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670 

You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later.  Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616.  If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.

It does matter however, for those looking to the identifying URI for some idea of authority.  As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying.  Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information.   The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit  links between URIs at a later date.

In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many.  An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment.  Yes it is worth bothering, just don’t expect benefits on day, or even month, one.

 

 

 

Linked Spending Data – How and Why Bother Pt2

linkedlocalgovI started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption.   So learning from that experience I am going to start with another assumption in this post.  Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing.  Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.

In this post I am going to be concentrating more on the How than the Why Bother

homeTo help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples.  Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/.   On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council. 

However, it is more than a web site.  Inspection of the Download data tab shows that you can get your hands on the source data in csv format.  Here is one line, representing a line of expenditure, from that data:

"http://statistics.data.gov.uk/id/local-authority/41UD","Lichfield District Council","2010-04-06","7747","http://spending.lichfielddc.gov.uk/spend/8605670","120.00","BRISTOW & SUTOR","401","Revenue Collection","Supplies & Services","Bailiff Fees",""

… which represents the data displayed on this human readable page:

Lichfield District Council Spending Data - Details of payment number 8605670
Looking through the csv, you can pick out the strings of characters for information such as the date, supplier name, department name etc.  In addition you can pick out a couple of URIs:

Linked Data for Lichfield District Council %007C statistics.data.gov.uk In the context of csv, that’s all these URIs are, identifiers.  However because they are http URIs you can click through to the address to get more information.  If you do that with your web browser you get a human readable representation of the data.  These sites also provide access to the same data, formatted in RDF, for use by developers.

Source of http___spending.lichfielddc.gov.uk_spend_8605670.rdf You can see that data by adding ‘.rdf’ to the end of the address, thus: http://spending.lichfielddc.gov.uk/spend/8605670.rdf and then selecting the ‘view source’ option of your browser for the page of gobbledegook that you get back.  

Inspecting the RDF, you will see that most things, except descriptive labels and financial values, are are now identified as URIs such as http://spending.lichfielddc.gov.uk/subjective/bailiff-fees and http://spending.lichfielddc.gov.uk/invoice/7747.  Again if you follow those links, you will get a human readable representation of that resource, and the RDF behind it by adding a ‘.rdf’ suffix.

The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things.  Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.

Secondly, you may have spotted that there is no date, or supplier name or identifier.  That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.

BBC - Wildlife Finder - Whooper swan facts, pictures & stunning videos Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc.  This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to.  Another great example of this effect can be found on a site such as the BBC Wildlife Finder.  That is not to say that you could not have created such a site without even considering Linked Data, of course you could.  However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.

The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority.  By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.

So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s?  The simple answer is that you may not have to – others may help you do some, if not all, of it.   With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are.  As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source.  Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.

As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this.  But that is for Part 3….

Linked Spending Data – How and Why Bother Pt1

linkedlocalgovNational Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months.  For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.

This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start? 

I believe that spending data is a great place to start, both for publishing local government data and for making such data linked.  Someone at national level was quite astute choosing spending as a starting point.  To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category,  a payee, and an amount.  At a very basic level it is very easy to measure if an authority has done that or not.

Guidance from data.gov.uk expands on this a little by mandating the following:

  Body This should be the URI that represents (or more properly ‘identifies’ – see below) the local authority at statistics.data.gov.uk.
eg. http://statistics.data.gov.uk/id/local-authority-district/00CN
  Date Should ideally be the payment date as recorded in purchase or general ledger
  Transaction number To identify within authority’s system, for future reference
  Amount In Sterling recorded in finance system
  Supplier Details Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
  Expense Area The part of the authority that spent the amount
  Service Categorization

Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.

… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.

As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format - supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.

So fairy clear, and measurable, then. You either have published your required basic elements of data in a CSV format file, or you have not.  Couple this with the political ambitions and drive behind the Government’s Transparency Agenda, and local authorities will have difficulty in not delivering this.  Although some are being a bit tardy and others seem reticent to publish in formats other than pdf.

OK so why bother with applying Linked Data techniques to this [boring] spending data?  Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent.   Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.  

For those new to Linked Data, the use of URIs as identifiers often causes confusion.   A URI, such as  http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans.  It has couple of attributes that make it different from traditional identifiers.  Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier.  This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion.  Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies.  In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.

Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How)  leads to a Linked Data representation of the data.  In technical terms a comparatively easy step using scripts etc.

By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.

If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities.  Which brings me to ….

linkeddata_blue Why bother publishing any local authority data as Linked Data?  Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government.  For example, the data held at statistics.data.gov.uk includes which county an authority is located within.  By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.  

As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc.  Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data  platform for innovating development and the publishing of other data that will link in with this basic but important financial data.

There are however some more technical issues, URI naming, aggregation, etc.,  to be overcome or at least addressed in the short term to get us to that foundation.  I will cover these in part 2 of this series.

Challenges and Opportunities for Linked Data

Yesterday I gave a short talk at Online Information 2010 titled “Challenges and Opportunities for Linked Data” (abstract). The presentation highlighted what I saw as the main challenges that face us as we grow the web of data, and highlighted some opportunities for organisations that want to get involved.

I believe there will be video from the various presentations online at some point, but wanted to post a transcript of what I said (or had planned to say!). The slides are up on slideshare if you’re interested, although they’re largely just transitions to highlight my main themes.

Introduction

2010 has certainly been the year of Linked Data. I’ve been working with RDF and Semantic web technologies for about 10 years now, and its clear that the last 12 months have been one of the critical growth points for Linked Data and the semantic web as a whole. There has been more debate, engagement, and publication of data over than ever before.

This is in no small part due to the fantastic work that has taken place at data.gov.uk. The project has not only championed the approach but also lead the way as an exemplar for how to do this stuff really well. The adoption of RDFa by Facebook, Google and others has also created a much needed feedback loop that is driving the publication of more structured data.

But as the technology grows we’re starting to experience growing pains which are presenting challenges for further growth and adoption. I think we’re also getting a sense of the opportunities that may arise from the web of data. I picked out three key challenges to review in the presentation.

Craft

The first of these relates to what I’d call “the craft” of Linked Data. To date the growth of the Linked Data cloud has largely been driven by skilled artisans — from academia and a small number of commercial organisations — who know how to work with the technology, how to use and manipulate the data that is already available, and how to get things online and linked together in a way that achieves the 5 star approach.

To scale beyond the initial Linked Data community we need to move from an artisan lead approach and enable “journeyman” developers to achieve the same things. There are several facets to this skills transfer.

Tooling is clearly one important area. It’s a truism that Linked Data tools aren’t as polished as they might be. After all it’s still a relatively new technology area. The majority of Linked Data artisans have been happy enough either to make their own tools or to work with a disparate selection of tools to get the job done. But there is still a lot more work to do in creating a more integrated toolkit that journeyman developers can reach into to help them quickly and easily publish data.

To be fair though, I think we’ve needed these past few years of publishing and experimentation to really highlight what those basic tools might be.

The other aspect of craft is education and training. There’s still a relatively small community with deep skills in this area, so thought has to be given to the ability to transition wider. Having helped train and advise a number of team and organisations over the past few years, most recently as part of our consulting work at Talis, its clear that there’s a journey or apprenticeship that many teams and organisations undertake as they begin to experiment and gain experience with the technology.

Within the Linked Data community we need to prioritise the work on these tools and services to make it easier for others. We also need to devote additional work to help nuture or define more standard vocabularies for publishing specific types of data. In my opinion this is the real challenging work: it’s not as fun or exciting as publishing the next new dataset or exemplar, but it’s absolutely necessary to push things to the next level. It’s going to take real commitment from all of us.

In my mind there is no better way to help pass on the skills of the initial artisan community than by encoding that knowledge in the form of tools, vocabularies, best practices and design patterns.

Fuelling Applications

Linked Data isn’t being used as much as it could or should be. Why is this?

I think there are two reasons. The first relates to my previous point about enabling the “journeyman” developer. Right now it takes a certain amount of skill to get the most from Linked Data and SPARQL. This presents a road-block for developers who may be interested in using some of the available data. It may even stop them looking at all.

To solve this we must be ready to meet people half-way. Publish simple JSON formats alongside the RDF. Use the Linked Data API created for data.gov.uk to provide simple RESTful APIs into your RDF data. Choice opens up more integration opportunities as well as encouraging engagement. The power of SPARQL and other tools is fantastic, but that power is not needed by every developer in every application. Be inclusive when opening up data.

A potentially larger issue is that much of the data available as Linked Data is either static, irregularly updated, or already available in other more accessible formats and APIs. This isn’t true across the cloud as a whole, but timeliness is an issue in many areas. It’s a consequence of the early boot-strapping process which emphasised conversions of available data dumps, and the wrapping of existing APIs and services. As a boot-strapping process that has been fantastic. But it’s not driving engagement: why use data if you can get it somewhere else easier, and in a more up to date form, using tools that you’re already familiar with?

I also think that this is contributes to the reason why it has been difficult to show the power of Linked Data: many of the demonstration apps could easily have been built with other APIs. I think this could be on the cusp of changing as there is now a critical mass of information available to do some powerful queries, and an increasing amount of data is now becoming primarily available as Linked Data.

The challenge we face is changing the nature of the Linked Data cloud from what is a largely static and slow moving environment to one that is much more lively and real-time.

Sustainability

The third challenge I highlighed was sustainability. It’s easy to look at the Linked Data diagram and think: “Well, those bits are done, all we need to do is look how to grow the diagram. We just need to add more data”. I think that’s a natural but unfortunately misleading viewpoint: we need to look carefully at our foundations.

Not all of these sources are on infrastructure that could support real, high volume usage. And few of the datasets are clearly licensed. I’ve personally encountered a number of occasions where some significant datasets are offline or unavailable. So we need to be realistic about whether people can build a stable, commercial application against the web of data as it exists today.

Again to solve this, we need an increasing number of primary sources, making high quality data available on a regular and timely basis, backed by the ability or commitment to deliver those services at the scale we will all eventually require.

In reality this challenge isn’t unique to Linked Data. It’s largely true of the web as a whole; after all not every web site or application is intended to scale to high volume usage. But we’re now talking about a potentially much deeper integration between different applications. We can see the same issues occuring around APIs and data access in general. In recent months there have been a number of stories of developers scrabbling to adapt as APIs get changed, taken down, restricted or re-licensed leaving them high and dry.

To me the beauty of Linked Data, and RDF specifically, in this regard is that it is so much more portable than any other format. This means that we can easily replicate data to share the load of providing access. With Linked Data we have the option of federating or sharing data across the web. (One of the reasons we started the Talis Connected Commons scheme was to help create sustainability around Public Domain datasets.)

The portability of RDF also makes it easier for a range of organisations to offer scaleable value-added services over the same datasets. For the first time we can decouple the curation of data from the delivery of services over that data.

So those are my three challenges. I think these are largely point in time issues, but we’re going to have to work at them to move forward.

What about the opportunities?

Become a Hub

One of the interesting properties of the Linked Data cloud diagram is how it clearly illustrates the emergence of a number of hubs — like dbpedia — that form the focal points for links from a number of different datasets. If you look closely you can also see that there are emerging hubs within specific subject domains.

I wonder whether the hubs that we see today will continue to play such a key role as the web of data evolves? My feeling is that in a few years time the picture and connectivity is going to be quite different. Particularly if we continue to see engagement from government and other sectors.

There is clearly an opportunity here for organisations who are already key enablers within a particular sector to become a linking hub on the web of data.

If you poke around in any industry, its not hard to find organisations who act as the “switchboard” for that particular sector. Either because they manage some key identifiers for the sector as a whole, or because their identifiers and systems have become de facto standards for achieving interoperability. It would be a natural step for those organisations to carry that role forward to the web of data, retaining that key position.

Clearly not everyone can be a significant hub like Dbpedia. But every organisation can act as a hub for its community of customers, partners and users.

The reasons and benefits for doing this are well documented: opening up data can drive new business, innovation, and traffic. Success on the web involves giving your organisation the greatest possible surface area and points of attachment. Linked Data is an excellent way to achieve this as to emphasises the right forms of web integration.

Turn Identifiers into Channels

Linked Data requires you to assign URLs to identify things: people, places, events, whatever. Generally we tend to focus on how that is an important step to publishing data: concentrating on the mechanics of what makes a good, stable identifier and highlighting how this becomes a key way for other people to find your data.

What this misses is that those identifiers can also become channels, or hooks, for your organisation to find other people’s data. Once you have published Linked Data and it becomes linked to by other datasets all of that external data annotates and enriches your own, providing valuable and useful context. Linking data creates network effects, and everyone in the network benefits. That includes you.

The external data is easily accessible through link discovery so it becomes much easier to find, aggregate and analyse it for a variety of purposes. That might be to drive new product features, or to simply power business intelligence and analysis within the enterprise.

I tend to think of it as being able to fish the web of data for useful context. Your URIs are the hooks. Your data is the bait.

I stopped to draw a parallel here with some comments made by Dion Hinchliffe in his opening keynote. Hinchcliffe pointed to the rise of a number of startups and tools supporting analysis of data collected from the open web, perhaps mixed with data from internal enterprise systems. The end results of that analysis is new data and insights that will need to be integrated into an organisations core systems, especially if the intent is to drive more than just management reports.

My prediction was that over the next 12-24 months we’ll begin seeing this type of third-party organisations not just offering SaaS access to analysis systems, but direct insights that are already integrated into a customer’s data via the public identifiers its sharing as Linked Data. This has huge potential value and can completely change the costs and approach to data integration.

The time scales may be completely off. But there’s a real opportunity there in my opinion, particularly for organisations that do market and social media analysis.

Data as a Service

It’s been said before but its worth repeating: Linked Data isn’t necessarily Open Data. The technology is not at odds with exploring business models around data services or access.

The “Data as a Service” (DaaS) idea is gaining momentum in a number of different areas with an increasing number of commercial APIs coming online. We should also soon be seeing commercially available services directly powered by open data sources or through mining those sources.

There are a number of different business models that can be wrapped around data access, ranging from charging for the data itself, through cost recovery for service provision — something that may be relevant for long term usage of government sources — or just charging for delivering reliable, high performance services over open data. There are good reasons why developers may want to pay for reliable services.

Clearly open, sponsored access to data and services will remain an important part of the ecosystem. In fact some level of open data is required to drive the network effects we are seeing around Linked Data: the identifiers and some key metadata needs to be open and remain open; but additional “depth” could be available at a premium.

Summing up

I had no big conclusions to draw from my talk as my goal was to highlight the challenges and opportunities ahead. Clearly I could have chosen a different mix but drawing on my recent experiences engaging with a wide range of different organisations these are the issues and opportunities I’ve most commonly encountered and discussed.

Do you have a different perspective? Perhaps some ideas about how to face these challenges, or a different view of the immediate opportunities? If so, I’d love to hear from you.

Linked Open Data and Pavlova

rjw_caricature_mini If Sir Tim Berners-Lee can equate Linked Data with a packet of  crisps/potato chips, I thought I would take a stab at another food metaphor for this post. 

Linked Open Data (LOD) is a concept that many believe they understand.  Take yourself to most any conference that has a connection with data, or the web, or the Internet at the moment, and it will not belong before you see a slide of the Linked Open Data cloud diagram, or of Sir Tim imploring us to give him our raw data now, or if you are very lucky a shot of him doing his imploring whilst stood in front of a shot of the LOD cloud.  -  Simple really, just publish your data as Linked Open Data and all will be wonderful as we move towards the sunlit Semantic Web uplands.  Unfortunately life is never that simple – LOD is not a single identifiable thing.  As Paul Walk eloquently puts it:

  1. data can be open, while not being linked
  2. data can be linked, while not being open
  3. data which is both open and linked is increasingly viable
  4. the Semantic Web can only function with data which is both open and linked

As with any recipe for success, the majority concentrate on the final result.  Praising or criticising it as a whole, without identifying the benefits or otherwise, of the individual ingredients.  Take a strawberry pavlova for instance.  If you you are in to that kind of thing, a delightful culmination of the culinary arts designed to send your taste buds in to raptures.  Unless that is, you don’t like cream, or you don’t like strawberries, or can’t abide meringue, in which case the whole thing seems a little pointless.

What has this got to do with Linked Open Data (LOD), I hear you ask.  Well, I am increasingly seeing LOD being presented as the goal for those wishing to publish their data on line.  My position is that the eventual goal, from which will spring a Semantic Web, is a global web of linked and open data. However, there are many steps from where we are now to achieving that goal.  Within audiences that I present to, and/or sit amongst, I see people who for whatever reasons do not ‘get’ one or more of the components of LOD – they cannot envisage opening up any of their data, or think that using a web address for an identifier is over complex, or have a religious aversion to RDF.  As a result they dismiss the whole recipe as not for them, or worse still, as something impractical that will become nothing more than the plaything of a few passionate enthusiasts.

When someone who is still struggling with the concept of opening up their organisation’s data; or why RDF might be a more useful format than csv, is shown the ubiquitous Linked Open Data cloud diagram with encouragement to join in – it is hardly surprising they remain a little unconvinced.  This isn’t a criticism of presenters either.  In only 20 minutes on a stage, it is difficult to go into underlying detail.

Let my try in a few paragraphs to break the LOD pavlova in to it’s ingredients

  •  Data – In the context of  this post, by data I mean machine readable information, produced in a format that can be consumed and processed by other machines.  Inevitably, this means file formats such as csv, XML, RDF, etc. , but not something like pdf, html, or word, which although they are in a transferrable format it is designed for human consumption not machine analysis.

    For some, just this step from their current human targeted format, to a machine readable one, is a significant one.

  • Open Data  – Data (see above) which is accessible for all to download, view, and consume in a way that is not encumbered by licensing that restricts its use.  For example, the licensing used by data.gov.uk data.  By definition data which is restricted for certain uses is not fully open.  

    In our internet based world, openness can also be defined in terms of technical accessibility.  If it is only available after a login process, or it is only available to users behind a firewall, it couldn’t be considered as open. 

  • Linked Data – Data (see above) which contains URIs as identifiers for concepts described in the data and URIs to identify the relationships between those concepts.  The four Linked Data Principles, as published as a design note by Tim Berners-Lee, provide a bit more detail on this.

    I am in danger of stirring the embers of a religious fire fight here, between those that believe that Linked Data must be described in RDF and contain URIs as identifiers, and those that maintain that you can have data linked across the web without those constraints.  All I am going to say on that at this time, is that the Linked Open Data cloud of data sets has been successful, based on the first of those two views. (if you want to follow that particular debate in more detail, Paul Miller’s post and associated comments would be a good starting point)

So, how can data be open, but not linked? – by publishing in in a non-Linked Data form such as a text file or a html page or a pdf.  Where would you find this? – all over the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to consume.

How can data be Linked but not open? – by publishing it in accordance with the principles, in RDF, with URIs, but restricting access either by imposing restrictive licensing conditions or restricting access to the data.  Where would you find this? – again all over the web, but often hiding behind restrictive licensing terms such as “non-commercial use only”.  Also to be found inside organisational firewalls.  For example, commercial organisations can realise the benefits of  using Linked Data techniques with their internal private data.  Potentially linking it to publicly visible concepts across the web to add even more value for their employees.

Data that is Linked and Open, like that strawberry pavlova, has the power to deliver value beyond the sum of its individual ingredients.  By providing data in a form that is linked to other data, and easy for others to link to, without restrictions on who or how that linking takes place, provides the foundation for a web of linked data built on the same principles that fostered the growth of the web of documents that has so changed our world over the last decade and a half.

The ingredients that formed that World Wide Web of documents – html, http, open publishing of web sites without restrictions on other’s abilities to consume and/or link to them – individually  were important developments.  However, when those elements were blended together their effects were multiplied many fold and resulted in the web we experience today. 

So [as I stretch my culinary metaphor to it’s limits] if you are hoping to take people with you in building a Linked Open Data future, you not only have to show them a picture of the final dish, you need to describe the individual ingredients and their relevance to the eventual result.

Pictures from Flickr by PhOtOnQuAnTiQuE and avixyz

Linked Data Meetup

On Wednesday, I had the privilege to attend the first Linked Data Meetup down in Hammersmith. The day was a storming success, with talks and presentations from all over the Linked Data community: from academia to startups. I think the organisers were slightly overwhelmed, because in the end there were nearly 200 people there, making use of the Talis-sponsored bar well into the evening. Apart from being a good opportunity to catch up with people, this meetup had the feeling of a guild-meet of Linked Data professionals—with lots of different perspectives over similar problems.

The two panel discussions gave the opportunity for quite a range of different views and topics to be covered, and seemed to well. The first was about Government Data and was chaired by Carol Tullo from the Office of Public Sector Information (OPSI) and included Sir Tim Berners-Lee on a panel of five. The topics covered a swathe of issues with public data, licensing, rights and infrastructure. This panel had a certain gravitas I wasn’t expecting from a semi-formal “meetup”, probably because it was representing the UK’s actual public sector data workers. After much discussion about what it means to “link data” and what count as “LInked Data”, I was left with the important point from the discussion: there are important and well-placed people currently working to make public data public, and I look forward to the potential benefits this will have.

The second panel covered a topic which has become very important to me, and which is strongly tied up with the first: the Future of Journalism. Although I was unable to hear much of this discussion (there were a fair few of us in that hall!), I certainly found the questions asked of the panel particularly acute. There was a particular emphasis on advertising and the future of revenue for news media in an online world. From this panel, I took the view that Journalists report on the public happenings of their nations and worlds, and often what they’re working with is made available by the very institutions “making the news”. So, the work on public data has a strong bearing on journalism and on citizens’ collective knowledge of what’s going on in their worlds. Paul Bradshaw, who chaired this panel, published his notes from the session, which will give a good overview of the topics there!

I won’t report on every talk that happened here, though the programme is still available on the Meetup site, and if anyone has any links to slides or photos they’d like to share, just ping them in the comments. I had a great time, and I left feeling hugely excited by many of the projects and trends discussed there.