The Data Publishing Three-Step
In a conversation with data owners about how they should be publishing their data, it is usually not long before the following question turns up: “So, what do I actually have to do to publish my data?” Often the conversation then wanders off into a game of buzzword bingo–RDF, RDFa, SPARQL, dereferenceable URIs, triples, content negotiation, open data, Linked Data, end-points, etc.—to be followed by a blank look and the unuttered question "Yes, but what do I actually have to do to publish my data?”
In an attempt to simplify the answer to that oft unuttered question, I break things down in to three steps.
Step 1 Get your Data Out – for others to consume
Sounds simple. Just take the spreadsheet (or similar file) that you use to track information, post it on your web site and link to it from a description posted in an accompanying web page. It can be that simple, but there are things to consider:
- Licensing – will potential consumers of the data be confident on their ability to use and/or reuse it. (The UK Government are very clear on this)
- Is it open but opaque? – The terms, codes, identifiers etc. you use may be meaningless, or worse still ambiguous, to those outside your organisation, or even your department.
- Could your data be made more consistent with other data you, or similar organisations, already publish.
All things to be considered, but not to be put up as excuses for not publishing.
Step 2 Get your Data In – to an open linkable standard format
This is the most powerful step, which consists of identifying the elements in your data (organisations, locations, things, projects, types, etc.) and giving them unique identifiers then make these identifiers web links. Fortunately this may not be as onerous as it sounds. There are many publicly visible/usable identifiers that you can use for your data – for example:
For this step to be effective you really need to be modelling your data. Your [first class] data elements, and the relationships between them. Plus possibly relationships with external entities. The output of this step will be an RDF representation of your data to Linked Data Principles. You should also identify the process or rules to get from your source data in to this new form, enabling you to repeat for later versions of your data.
Having said all that, it is not necessarily only you that will/can do step 2. It is perfectly possible for a third party, or a central organisation such as data.gov.uk, or even an enthusiast, to carry out this data modelling and transformation step with data that you have openly published.
Next you need to publish your data so that it can become part of the Web of Linked Data, which brings me, with apologies to fans of the traditional party song, to…..
Step 3 Link it all about
Going through step 2 and not making your data available, or providing useful information at the end of the links you embed in your data, would be a bit of a pointless exercise. How to publish this data is the next question, to which there are at least three equally valid answers.
- Using an encoding technique called RDFa, you can embed the RDF data within the html coding of a web page so that software can obtain a more structured representation from a web page than a human, viewing it in a browser would.
- You could just publish the RDF in rdf files on your web server. A good example of this is the way the BBC publish the RDF for many of their pages, such as for their Wild Life. The Lion Web page – the RDF for Lion (dependant on your browser, you may need to use it’s view page source option to see the actual RDF encoded in XML)
- You could store the individual RDF statement (triples) in a triple store, or SPARQL end-point. This not only publishes the RDF, but also enables the data and relationships within the data to be queried. This is how data.gov.uk publishes RDF, from Talis Platform Stores. This interface might look a bit cryptic – the results, formatted in XML in the top box, from running the SPARQL query shown in the bottom box – but this is a developers interface demonstrating the code and results an application might use, so you wouldn’t expect much different.
I’ve decided to go through these steps, can you remind me again why? - So that your data can be linked with other data to add value to the experience of consumers of your data and services, as well as others using your data to add value elsewhere. A good example of this in action being the BIS Research Funding Explorer.




July 2nd, 2010 at 12:26 pm
[...] This post was mentioned on Twitter by rjw, infopeep and others. infopeep said: Nodalities Talis: The Data Publishing Three-Step http://bit.ly/amnu0k [...]
July 2nd, 2010 at 2:54 pm
Richard, I hate to point it out, but your simplified answers to the questions surrounding Linked Data “wander[ed] off into a game of buzzword bingo–RDF, RDFa, SPARQL” somewhere around Step 3.
July 2nd, 2010 at 3:38 pm
Fair comment Lee – at least the bingo did not start until step 3!
What I would say though, is by taking these separate logical steps towards publishing as Linked Data the inevitable buzz-words will become clear as the process roles through Step 2.
I note that @gavinwray tweeted that he is “still terrified by step 2.” Hopefully that won’t stop him taking step 1, and getting his data out there in the first place. Then either someone else may find it useful and take the 2nd step for him. There again, he could always try one of the free Talis Open Days that would give him a non-terifying introduction the Linked Data world or look for some more in-depth (non-scary) training.
July 2nd, 2010 at 6:03 pm
In all seriousness, I do agree & appreciate the work you and everyone else at Talis are doing to encourage open data. I’d like to think that tools like the ones we’re building at Cambridge Semantics should help ease the terror of Step 2 by making it much easier to get the data into Linked Data friendly formats… the tricky step (and one whose value is not always apparent) is how much serious modeling do you need to do to make the open data useful?
July 3rd, 2010 at 9:06 am
An interesting small post. We, a consortium of universities, companies and french institutions, are launching the project Datalift on this exact topic: Data publishing, providing tools for facilitating the publication process. We call this process data elevation: the way to get to data paradise
http://datalift.org
July 4th, 2010 at 10:43 am
[...] Petri Avoimen dataan liittyviä kysymyksiä käsitellään kiinnostavasti sekä Richard Wallisin (The Data Publishing Three-Step) että Gavin Starksin blogikirjoituksessa (Data is not binary : Why open data requires credibility [...]
July 5th, 2010 at 10:30 am
[...] The Data Publishing Three-Step (EN) [...]
July 5th, 2010 at 5:30 pm
[...] also the reference to the Nodalities Blog: Leave a [...]
July 6th, 2010 at 11:22 pm
Speaking from a biological scientist, not data scientist, perspective, you’ve lost 90% of people by step two. What’s modeling data? What’s a “first class” data element?
As things stand now, there’s a class of people who’d be happy to put data up on the web, but it’s an almost entirely separate class of people who’d come up with the data model, URIs, and RDF.
The group of people who could be expected to do all three “simple” steps above (that is, generate useful scientific data AND model it) is almost vanishingly small.
July 6th, 2010 at 11:43 pm
Not an unsurprising comment, it is often a separate group of people that take on the individual steps. Not everyone is a data modeller, and they shouldn’t have to be. Get your data our there [preferably with an eye on the possibility that just publishing it might not be the end-game] so that others can use and build upon it.
So if just Step 1. is something you can add to your workflow, fine. There are increasing number of people out there more than ready to identify useful and interesting data and move it on to be used in often unexpected ways.
July 8th, 2010 at 11:25 pm
[...] expected some comments to my Data Publishing Three-Step post last week but what I didn’t expect was a virtual pat on [...]
July 11th, 2010 at 4:30 am
[...] Nodalities blog has the post The Data Publishing Three-Step. Agree completely. Our cataloging already meets common standards, [...]
July 18th, 2010 at 10:27 am
@Mr Gunn: I’d would not so much worry about the language in Step 2, 3; the problem we face in Science, is to get the researchers level with Step 1. Some do, even in a non-semantic way:
http://usefulchem.blogspot.com/2010/07/methanol-solubility-prediction-model-4.html#links
But with communication, we can get these things semantic, as several have done for this (CC0-licensed) solubility data set:
http://chem-bla-ics.blogspot.com/2009/11/linking-two-virtuoso-instances-to-one.html (sorry, no Talis, but Virtuoso
http://friendfeed.com/egonw/176bfcf5/critical-mass-for-open-notebook-science-wikis
August 19th, 2010 at 3:01 pm
[...] is adapted from Chris Taggart‘s presentation on opening up local government data and the data publishing three-step by Richard [...]
August 24th, 2010 at 9:18 am
[...] the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to [...]
September 21st, 2010 at 11:37 pm
[...] the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to [...]