Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Author Archive

A Year of Open Government Data: Transparency, but also Innovation

Screenshot of data.gov.ukTowards the end of 2010, Wikileaks generates many headlines as it publishes information on the web, causing controversy and leading to talk about politicians hiding information from the public. Reporters and commentators express shock or admiration when telling the story of a rogue organisation making governmental information public. What has not been as mainstream is that for the past year or more, governments around the world have been doing something very similar themselves: publishing information online.

Big names like President Obama, Sir Tim Berners-Lee and the headliners at big events like the International Open Government Data Conference favour publishing public data for transparency and benefits to society. This all finally began to take off in 2010. Governments from around the world have been developing their public information strategies, with the launches of data.gov and data.gov.uk and data.govt.nz.

This is all taking place at a time of economic restraint. Dr Martin Read from the UK Cabinet Office’s Efficiency Reform Board explained in a recent interview: “If you are going to improve the efficiency of something, making that change involves risk and innovation  … If they get it wrong, they’re hauled up in front of a committee for interrogation.” (moderngov, November 2010) It may seem tricky to justify the expense of big projects like data.gov.uk, and there certainly seems to be a huge amount of pressure.

Nevertheless, governments are proving themselves committed to prioritising data publishing. Towards the end of last year, the UK Prime Minister announced that every item of governmental spending over £25,000 will be published online, and updated monthly. He emphasised the importance of this publication in terms of transparency, inviting the public to scrutinise the data. Interestingly, he also said: “This scrutiny will act as a powerful straightjacket on spending, saving us a lot of money.” So, not only is data publishing seen as a benefit to democracy, but also as a useful way to “flag up waste”.

While that press conference was taking place, developers and civil servants were gathered together elsewhere at the Open Government Data Camp (disclosure, Talis was a sponsor). At the event, much was made of the modelling and tools which have been developed with open data in mind: particularly the Linked Data API, which allows developers from just about any web background to work with data.gov.uk’s data very quickly. Visualisations demonstrated what can be done with well-structured data.

One of the things this high-level data publishing has done is raise the standard for what can be published and developed. Last year, we built a proof-of-concept app for the Department of Business Innovation and Skills (BIS) to illustrate the potential of applications of this data. A few minutes spent on DEFRA’s UK Climate Projections site shows what can happen when raw data is matched with a plan, and is designed with a citizen in mind. Anyone can check the primary source for their government’s climate policy, and it doesn’t take a climatologist to understand it. A little further development allows fully-fledged applications to be built that are instantly useful: one available on the front page of data.gov.uk lets me download an app that helps me plan my cycle route!

Open government data is probably good for transparency. But it’s also got a plenty of potential to seed ideas that add value to this information. Innovators know that there are more people with better ideas outside our organisations than could possibly be in them, so sharing means that they can be developed into products and services that are mutually beneficial to everyone. The web industry routinely works with open-source software that’s been at least partly built by others, and this open-source mentality might just be an incredibly useful piece in the public-sector machinery. Open business models work very well with ideas.

2011 promises to be the year when all this data gets put to use. I was recently invited to a press conference at which the Deputy Prime Minister confirmed the UK’s commitment to published data as a priority and even a recognised civil liberty. The story will shift to more local applications of big public data tools. January will see the publication of local authority’s spending data, and public bodies will be looking to add value to this data, bringing the headlines of open data to life in the places we live.

With a bit of thought into how data is published in the first place, and a plan for encouraging people with good ideas to work with this information, this investment in data publishing could be more than just a tick-box exercise for a political transparency agenda. I hope that this year, it won’t be Wikileaks-level events that get people talking about open data publishing. We should notice it improving services we use, and see whole new applications for the bits and pieces of information that make up our public lives.

Information as a Civil Liberty

“Free citizens must be able to hold big institutions and powerful individuals to account.”

I attended a speech at the Institute for Government by UK Deputy Prime Minister Nick Clegg at which he outlined the government’s stance on civil liberties. This topic is one I am particularly passionate about as a citizen of two democracies, and as a lover of history and human communication, but what was there to interest a software evangelist?

Mr Clegg’s speech is available as a transcript from his party’s site, so you can have a look at the same words I heard. If you read through a lot of the political positioning (references to “Labour”, for non-UK readers, refer to the majority party of the previous government), you get to the bit that interests me as a Talisian as well as a human.

The final point talks about citizens having the right to public information, and the right to speak out about what government (and, notably, publicly-subsidised industry) is doing. The freedom of information and freedom of speech are under the same heading. As Clegg put it:

“It is a modern right to information combined with traditional freedom of expression.”

Examples are given of current transparency measures, including the publishing of particular datasets that are already being used in innovative ways and to hold the government accountable. It’s clear from the speech that transparency is a priority, and that publishing data is seen as fundamental to this.

The theme of balancing security and freedom is repeated throughout the talk, alluding to the fact that some information in any government is clearly going to be kept secret. But the emphasis is on publishing wherever possible, and it was interesting that this felt like the most specific theme of an otherwise very high-level speech. This is an area of public policy that has been changing through the launch of data.gov.uk and the continued efforts of two successive governments (and, interestingly, all three major UK parties) to put public data online. The idea that these datasets will be used, reused, mashed up and seed innovation is at the forefront of these talks. This isn’t just data that can be seen, it’s data that can be used.

So, this government seems committed to continuing the trend for transparency through public information, and for their data to be made available online and in useful ways. The emphasis in this speech, however, adds a new dimension to the commitment, at least the way I understand it. It’s not just that data is a right of any free citizen—the Prime Minister said as much before he was PM—but that this right goes hand-in-hand with the citizen’s right to free speech.

Government publishing its data online, free to reuse and feed applications that make it easier to interact with the information has been a huge step. Alongside this is the area of libel reform, which is a topic too big to get into here but involves the scrutiny of scientific and journalistic investigation without the fear of prosecution. (Guardian journalist Simon Singh discusses libel reform here.)

Although Mr Clegg’s talk is mostly general, discussing big ideas and leaving out specifics, I think the principles discussed were hugely important, and it is good to see a further commitment to public data. As a Talisian, it’s great because we work a lot with this kind of data, and it means we get to do more interesting things with it. As a citizen, it’s important that we can see more of what’s going on within government and that it is being considered fundamental enough to mention alongside freedom of speech and libel reform encourages me.

What I’d like to see this year is the specifics, now. What specific things will make publishing public data easier and more thorough?

Working on Plings

logo for PlingsIt’s always good to work on projects that aim to make a difference and to contribute something: you could say we look for projects with some substance to them. So, it’s been fun to work with social research company, Substance on their Plings project. If you’ll forgive the opening pun, I’ll explain a bit.

The Plings project aims to gather together the best available information about “positive activities” for young people: PLaces to go + thINGS to do = PLINGS. Substance describes Plings as: a search and discovery tool that helps people to find accurate and trusted sources of information about positive activities for teenagers. So, I can look for Plings around Talis’ Birmingham offices, and find out about football coaching, cafes, dance and musical projects: all happening within a set radius of my postcode. It’s a versatile tool, letting the searcher facet their results and customise the display, and it also ties in with social networks (check out the fantastically-named “boredometer” for example.), and devices.

Feeding the Plings site is a dataset comprising two main parts:

  • Data on the actual activities: places to go, things to do
  • Data on feedback relating to the activities: “Plingback”

Substance uses various methods to collect the first dataset, routing it through their own API. This lets them use data from many different formats and shapes: from local authorities, third sector and community groups and the private sector. For Plingbacks, though, Talis has been working with Substance to create an infrastructure that can be used to generate data in RDF which Talis hosts through it’s Managed Service. There is a bit more about the Plingbacks app on appspot for more detail, too.

In short, the Linked Data approach enables Substance to have multiple Plingback widgets that can be presented through multiple channels. Because they all share the same API and data structure, they can use the Talis Platform to query and visualise the data dynamically.

Substance’s Steven Flower also told me a bit about a related project building on the back of Plingbacks and the Talis Platform called Plingalytics: a sort of dashboard enabling local authorities and stakeholders to get a very useful view of the Plings datasets. It will let them answer questions like: “How many Plings do we have on a Friday night?” or simply: “What’s hot? What’s not?”

This ties in with another side of Plings, which works with local authorities to “fulfil their statutory duty to publicise and keep up to date comprehensive and accurate information on positive activities for young people and to make it accessible,” according to Substance’s site.

It’s an exciting project to be working on, and I’m very interested in the way it ties in local government, young people, and activities through a very positive use of the Web. The fact that they’re using Linked Data to back the interrelated data makes a lot of sense, and we’ve been working together for a long time pulling together Linked Data opportunities and matching them with solutions. Alongside looking after the Plingback dataset on the Platform, our consulting team has worked with Substance to model and convert their data to RDF. In addition, and because of the open nature of the data Substance is working producing, Plings is able to make use of Talis’ Connected Commons scheme for some of its data: meaning that not only can this information be managed free of charge, but it’s available on an open data licence.

Steven Flower said: “We are very excited about this. From a technical point of view, the opportunity to build this upon Linked Data sets is also interesting. Hence, we have chosen to work with Talis for the infrastructure, knowledge, support and enthusiasm that they bring.

We have had the support of Talis since early days of Plings, so it’s good to continue.”

More information on the Plings project from Substance can be found on their Plings info page.

“Linked Data” at the Guardian

Nodalities Magazine article by Martin Belam.

During October at Guardian News & Media we announced a change in our Open Platform Content API. For the first time, developers and users could query our database of over 1 million content items by using the common external identifiers of a MusicBrainz ID or an ISBN number. It is our first step into the world of ‘Linked Data’.

The Open Platform Content API was launched as a beta in 2009, and earlier this year was launched as a commercial product, allowing partners to re-use Guardian & Observer content in a variety of different ways. There is, for example, a WordPress plugin that easily allows you to include Guardian content in your blog, and developers have built applications like a bespoke recipe search on top of the data. It is a unique proposition amongst news organisations on the web, and as well as the Content API itself, the Open Platform also includes publishing the source data behind Guardian journalism on the Data Store, and providing a search engine for Government datasets from around the world.

Why linked data at Guardian News & Media?

The addition of linked data to the API is the culmination of a great deal of work behind the scenes to get the data prepared, and to work out the right way to make it available. Personally, I had been struck the first time I saw the linked open data cloud diagram that none of the bubbles represented any of the UK’s traditional print news organisations. With our combined centuries of experience sifting, collating, organising and publishing information, it seemed to me that they should in fact be occupying a central position on that map. The principles of linked open data also chime with the over-riding principles we have about our web presence at Guardian.co.uk. We strive to be ‘of the web’, not just on the web. That means reaching out and embracing external services and data, and our intention is to have permanent, predictable URLs for all of our content.

The first challenge to implementing this was to pick stable and reliable external datasets that would form a permanent and meaningful relationship with our content. We decided that a focus on distinct cultural entities would work, and avoided the messiness of trying to decide whether a story was ‘about’ something, or whether it just ‘mentioned’ something. MusicBrainz IDs and ISBN numbers seemed like datatypes we could work with.

The domain model of our content already had a concept of an ‘external reference’ that can be added to a tag or a factbox or an article. We have previously used that to link articles to a page about a specific film, or to link a sports match report to game statistics provided by a third party like Opta. The obvious route was therefore to expose these ‘external references’ in our API

MusicBrainz IDs

musicbrainz ID in the APIWith MusicBrainz IDs, we did not attempt to tag all of our music story archive. There are around 42,000 music content items currently on our site, and to accurately add MusicBrainz IDs to them would be an arduous task. Fortunately, because of our domain model, we had a shortcut to tagging this content. All of the items in our database are given tags. These indicate the type of content (e.g. article, audio, video), the tone of content (e.g. news, comment, review, obituary), the contributor who produced the content, and keywords representing the subject the content is about. In the Music section, we have around 600 of the artists we write about most frequently who exist as keyword tags. The quickest route to adding MusicBrainz data was to add it to these artist keyword tags. The actual job of tagging was achieved via the rather dull mechanism of filling in a Google Docs spreadsheet, although developer Daithi Ó Crualaoich built a tool to help us. He came up with a quick browser-based hack that simultaneously put the same search string across our music tags and across MusicBrainz, and matched the outcome. A script then uploaded this to our database.

ISBN numbers

ISBN numbers were another obvious choice for us. The majority of our book reviews on the web feature a ‘fact box’, giving details of the publication and a corresponding link through to our book store to make a purchase. This ‘fact box’ frequently includes the ISBN number of the publication, and so exposing them as a search criteria was not a massive undertaking. Nevertheless, as with our music content, we do not have universal coverage. At the time of launch around 2,500 reviews out of a possible total of 17,000 had ISBNs attached to them. This is part of the production process now, and so all reviews going forward should have the ISBN added.

API query types

Open Platform API ExplorerThe Open Platform supports a range of ways to query this data, and you can find a guide at: http://gu.com/p/2k6ay. Obviously you can query the API looking for a specific reference, so a query for reference=musicbrainz/05ec70a5-3858-4346-a649-fda0a297b8c1 will return content about Shirley Bassey. Additionally, you can get a list of content which has a MusicBrainz or ISBN attached to it, so reference-type=musicbrainz|isbn will give you content from the API which has a MusicBrainz OR an ISBN added to it. Adding the ‘show-references’ parameter will return a block in your API responses that includes MusicBrainz IDs or ISBN numbers for any item within the list. If you’ve not used the Guardian’s API before, you can get a feel for how it works by using our browser based API explorer.

‘Linked data’ formats

It does seem that as soon as you put the words ‘linked’, ‘open’ and ‘data’ into the same sentence, you automatically invoke a debate about what formats are appropriate to use. At the present time we are making these persistent external IDs available alongside our content items in both XML and JSON formats. And yes, that does mean that we have steered away from RDFa and SPARQL.

From our point of view there is a clear reasoning behind this. We try to work in a lightweight and agile way, and providing the data in this format was the simplest way to meet our immediate requirements. We are trying to concentrate on making more metadata available. If we were to decide to invest in triple-stores and implement a SPARQL endpoint first, then I’d wager that we would still be waiting to dip our toe into the water.

Moreover, it would be wrong to commit our editorial production colleagues to tagging up all our content with this extra layer of semantic data, if we can’t show the benefits. It is my hope that by incrementally releasing extra layers of linked data through our API, in a simple way, we can see what works and what doesn’t, and what types of data interest people and inspire them to develop applications using the data

As I’ve personally argued before, particularly in response to Tom Coates’ recent call for “Death to the Seamntic Web”, I’m entirely agnostic about formats myself. What I think is most important is that we provide consistent, RESTful, predictable, persistent hooks into Guardian.co.uk content, in as many ways as possible, with the right licence for re-use.

What next?

We are now evaluating where else we can add value to our API with joins to external datasets. Again we will aim to be pragmatic—tagging the most amount of data with the least amount of effort. And we also want to listen to the linked data community—what are the data joins that would be most useful to external developers?

Martin Belam is an information architect at the Guardian newspaper.

Talis Inc

Talis Logo

Having moved over to the UK from the States quite a few years ago now, one of the things I noticed about company names was that they tend to use “LTD,” and for reasons unknown, I somehow always thought Talis Inc sounded better than Talis LTD.

Well, I’m very happy to be able to say that Talis Group LTD, will now have a new subsidiary with the excellent name: Talis Inc. The Inc means, of course, that we’ll have a new member of the Talis Group bringing our Platform, managed services and expertise to the United States.

Based in Virginia, Talis Inc will be ably lead by Bernadette Hyland, the new CEO of Talis Inc. She will be joined and supported by David Wood as VP Engineering. Together, Bernadette and David bring to Talis a huge amount of Semantic Web experience and a remarkable reputation: both entrepreneurs were founders of Tucana—one of the first commercial triple store vendors—and were most recently at the Semantic Web consultancy Zepheira.

Alongside a new subsidiary comes Talis’ first US customer: the US Government Printing Office (GPO). Talis will be running the GPO’s PURL infrastructure, which provides provides persistent Web addresses for critical government documents and is primarily used by the more than 1,200 Federal Depository Libraries. The PURL server uses the PURLz open source software, the development of which was led by David while at Zepheira, and complements the data hosting and search capabilities of the Talis Platform with identifier management functionality.

So, please join me in welcoming a stellar entrepreneurial team, our first US customer, and the addition of an Inc to the Talis family!

edit: under development. LOD cloud and Talis’ Datasets…

Tweet from Richard Cyganiak

Last week, Richard Cyganiak and Anja Jentzsch launched their latest version of the Linking Open Data cloud diagram. You will have seen this diagram, I’m sure, in its various iterations over the years. From the cover of early Nodalities Magazines to the slides of most any Linked Data presentation you care to recall. Richard and his team have done a fantastic job of creating a useful picture of the Linked Data cloud, and its evolution from a few circles and sticks to the complex and massive diagram you can see on Richard’s site.

Richard humorously tweeted the day it was launched: “Did you hear that? The sound of a hundred linked data advocates updating their slides ;-) ;” and he can’t have been far off. Also fortunately, Richard and Anja have made the LOD cloud available under a CC By-SA license, meaning that not only can Linked Data folk pinch a copy of the LOD cloud for their slides, but can update and modify it too.

My colleague Rob Styles put together a coloured version of the LOD cloud with a bit of a Talisian twist. Below is the current (as of Sept 2010) LOD map highlighting datasets Talis has been involved with. So, with each of the coloured circles, we’ve created the RDF itself, hosted a Linked Data version of an existing dataset, helped to model the data or provide for it data access tools (like a SPARQL endpoint). It’s very exciting to see, and also surprising seeing this picture in such clear context!

Edit:
It looks like the version I posted earlier was a draft, and the next version will be along shortly.

I should clarify that we are just highlighting where we have helped the Linking Open Data project by offering support, expertise and hosting. The LOD cloud is the collective effort of dozens of organisations and individuals who have worked tirelessly to promote the project. We are proud to be part of such an exciting and growing
community.

We’ll put up a new version when it’s been developed a bit further.

Public-sector Pay and Panorama…

panorama explorerA couple weeks ago, the BBC asked us to load a set of data into the Talis Platform to support an upcoming episode of Panorama. The episode, airing tonight at 8:30pm BST, will cover public-sector pay. They’re looking particularly into the topic of the highest-paid public sector jobs, especially the jobs of senior civil servants paid more than the UK Prime Minister.

The episode, which aired last night at 8:30pm, covered public-sector pay. It looked particularly into the topic of the highest-paid public sector jobs, especially the jobs of senior civil servants paid more than the UK Prime Minister.

So, we modelled the data the BBC supplied, converted it into Linked Data and loaded the lot into the Talis Platform. The BBC’s is pulling data from the their Platform stores to power the Panorama exploration tool, which you can use here.

The exploration tool gives you an interactive view of where top public-sector salaries are going, sorting by sector and giving you a facetted picture. So, you can have a quick glance at the top 10 positions in Local Government, then filter down to find those of Wales, or even deeper and have a look at the district councils of, say, the Northwest of England.

The explorer is making use of the Linked Data API—the same thing that works with data.gov.uk—giving their developers the data formats such as JSON which are used in the application. So, whenever you click your way through the explorer, you are pulling at the end of an interesting string of data-driven wheels and cogs; the end of which is all linked up and SPARQLy.

The BBC have taken Linked Data very seriously, and it’s even something that’s influenced the way they’re thinking about information architecture more widely. They’ve built much of the framework behind projects like the Wildlife Finder and their World Cup site on Linked Data principles. For a peak at this world, a great place to start would be Silver Oliver’s recent post about the Semantic Web. And for more about the way this story unfolds, watch last night’s Panorama on BBC iPlayer if you’re in the UK.

Talis Training: Intro to the Web of Data

Intro to the Web of Data

21-22 September

76 Portland Place, London

26-27 October

Talis Offices, Birmingham

So, we’ve been running a series of Open Days which you can’t have failed to notice here on the Nodalities blog. We’ve covered very broad topics related to the Semantic Web and Linked Data, giving an overview of graph-thinking with data, URI’s and some direction.

But the question keeps coming up: “How does my team actually use Linked Data?”

We’ve done quite a bit of training, both bespoke consulting and as a set course, and you can read a bit more about that over on our consulting page. We’re now hosting a series of open-registration training courses: A 2-day introduction to the Web of Data.

The course provides an in-depth introduction to all of the core technologies that a developer will encounter when working with and publishing Linked Data. It includes a thorough introduction to the RDF model; modelling of data using RDF Schema; publishing of data to the web as Linked Data, and querying RDF datasets using SPARQL.

We’re offering a discounted early-bird price for the first two courses of £1,000 per attendee (ex VAT) if booked before 1st October. We’ll be putting on lunch and our now-famous SPARQL blend coffee from Union Hand-Roasted Coffee, too! Places booked after 1st October cost £1,200.

The first course will be on 21 and 22 September at No 76 Portland Place, London. The second will be at our offices in Birmingham on 26 and 27 October.

Best Buy: Semantic Web and Retail

In this Nodalities Podcast, I speak with Jay Myers from Best Buy about how he and his team are working within the retail giant to better harness their data. Jay tells us about his use of blogs and RDFa to better manage “open-box” products returned to Best Buy’s many stores in an effort to surface deals to the public and make savings on otherwise costly problems.

Jay also explains how Best Buy are publishing the machine-readable data out on the public web and touches on the next steps Best Buy will be taking. He also calls on the Semantic Web community to take an active role in promoting work like this by voting for his panel at South by Southwest, which you can see here.

Jay Myers is a Lead Web Development Engineer for Best Buy, and is an active supporter of the GoodRelations vocabulary for ecommerce, utilizing it for modeling consumer products, stores, and services in both RDF/XML and RDFa. For more information, you can read his blog or catch him on Twitter.

Linked Data and Health: Speakers

We’ve had an overwhelming response to our Linked Data and Health open day, which will be running this Thursday in London—there are no places left!

As a quick intro to the day, I’ll quickly post a bit of information about some of our guest speakers here, with a working title for their talks. (Please note, that the titles may change).

Alongside our guest speakers, several of us from Talis will be talking about the wider world of Linked Data, giving an overview, demos of LD applications in use, and doing our best to answer the seemingly simple question: “Why Linked Data for health?”

Dr. Nigam Shah

Dr. Shah’s research is focused on developing applications of bio-ontologies, specifically building ontology-based applications in the biomedical sciences and using Semantic Web technologies to improve search and integration of biomedical information. He teaches at Stanford on topics of how to make and use biomedical ontologies, current trends & future directions in biomedical ontologies and reasoning with biomedical data. He has co-chaired the Bio-Ontologies meeting at the ISMB conference since 2007.

Dr. Shah’s talk is: Opportunities for applying semantic technologies to health care data.

Dr. Michael Wilkinson

Dr Michael Wilkinson is the Business Development Manager for the NHS National Innovation Centre (NIC). Michael is currently leading on a programme of work to create a linked data platform to speed development of technological innovations likely to benefit the NHS. The NIC works across sectors and encourages collaboration between innovators from industry, academia, and NHS clinicians, scientists, and procurement officials. The NIC also works with other government departments and the EU to improve efficiency of innovation procurement. Prior to joining the NIC, Michael was an academic at the London School of Hygiene and Tropical Medicine. He has also held appointments at the Cabinet Office, Nesta, and hospitals in the USA.

Mark Birbeck

For a number of years Mark Birbeck has been involved in helping to bring about the Semantic Web, and has consulted, written and spoken widely on this and related topics. He is the originator of the W3C’s RDFa standard, and most recently he has been working on a number of semantic web projects for the UK government.

Mark will be speaking with Dr. Wilkinson, introducing the NHS clinical widget platform, which they jointly wrote about in Nodalities Magazine (pdf).

Dr. Jun Zhao

Dr Jun Zhao is an EPSRC Postdoctoral Fellow from the Department of Zoology at the University of Oxford. She has computer science research background in various domains, including e-Science, provenance, Semantic Web and biological data integration. She has more than six years’ experience of applying Semantic Web research and technologies to bioinformatics and biological information representation and integration. Currently she is running her fellowship project, Open-BioMed, which investigates the use of Web of Data for publishing and integrating biomedical data resources and the role of provenance information for evaluating their trustworthiness. She is actively involved in both the W3C Health Care Life Science Interest Group and the W3C Provenance Incubator group.

Dr. Zhao’s talk will be: Linked Data for Biomedical Science: A Tale of Two Success Stories

Leigh Dodds

Leigh has significant experience of working with Semantic Web and Web technologies as both an independent hacker, researcher, as well as in production environments in a number of roles including developer, software architect and product manager. He has written about, and spoken widely on a range of semantic web topics include SPARQL, Linked Data, managing and aggregating data on the web, semantic web application development, and data licensing and management. Leigh is currently employed by Talis as the Programme Manager for the Talis
Platform and is responsible for both product strategy and business development.

Leigh’s talk is: Why Linked Data for Health?