Challenges and Opportunities for Linked Data
Yesterday I gave a short talk at Online Information 2010 titled “Challenges and Opportunities for Linked Data” (abstract). The presentation highlighted what I saw as the main challenges that face us as we grow the web of data, and highlighted some opportunities for organisations that want to get involved.
I believe there will be video from the various presentations online at some point, but wanted to post a transcript of what I said (or had planned to say!). The slides are up on slideshare if you’re interested, although they’re largely just transitions to highlight my main themes.
Introduction
2010 has certainly been the year of Linked Data. I’ve been working with RDF and Semantic web technologies for about 10 years now, and its clear that the last 12 months have been one of the critical growth points for Linked Data and the semantic web as a whole. There has been more debate, engagement, and publication of data over than ever before.
This is in no small part due to the fantastic work that has taken place at data.gov.uk. The project has not only championed the approach but also lead the way as an exemplar for how to do this stuff really well. The adoption of RDFa by Facebook, Google and others has also created a much needed feedback loop that is driving the publication of more structured data.
But as the technology grows we’re starting to experience growing pains which are presenting challenges for further growth and adoption. I think we’re also getting a sense of the opportunities that may arise from the web of data. I picked out three key challenges to review in the presentation.
Craft
The first of these relates to what I’d call “the craft” of Linked Data. To date the growth of the Linked Data cloud has largely been driven by skilled artisans — from academia and a small number of commercial organisations — who know how to work with the technology, how to use and manipulate the data that is already available, and how to get things online and linked together in a way that achieves the 5 star approach.
To scale beyond the initial Linked Data community we need to move from an artisan lead approach and enable “journeyman” developers to achieve the same things. There are several facets to this skills transfer.
Tooling is clearly one important area. It’s a truism that Linked Data tools aren’t as polished as they might be. After all it’s still a relatively new technology area. The majority of Linked Data artisans have been happy enough either to make their own tools or to work with a disparate selection of tools to get the job done. But there is still a lot more work to do in creating a more integrated toolkit that journeyman developers can reach into to help them quickly and easily publish data.
To be fair though, I think we’ve needed these past few years of publishing and experimentation to really highlight what those basic tools might be.
The other aspect of craft is education and training. There’s still a relatively small community with deep skills in this area, so thought has to be given to the ability to transition wider. Having helped train and advise a number of team and organisations over the past few years, most recently as part of our consulting work at Talis, its clear that there’s a journey or apprenticeship that many teams and organisations undertake as they begin to experiment and gain experience with the technology.
Within the Linked Data community we need to prioritise the work on these tools and services to make it easier for others. We also need to devote additional work to help nuture or define more standard vocabularies for publishing specific types of data. In my opinion this is the real challenging work: it’s not as fun or exciting as publishing the next new dataset or exemplar, but it’s absolutely necessary to push things to the next level. It’s going to take real commitment from all of us.
In my mind there is no better way to help pass on the skills of the initial artisan community than by encoding that knowledge in the form of tools, vocabularies, best practices and design patterns.
Fuelling Applications
Linked Data isn’t being used as much as it could or should be. Why is this?
I think there are two reasons. The first relates to my previous point about enabling the “journeyman” developer. Right now it takes a certain amount of skill to get the most from Linked Data and SPARQL. This presents a road-block for developers who may be interested in using some of the available data. It may even stop them looking at all.
To solve this we must be ready to meet people half-way. Publish simple JSON formats alongside the RDF. Use the Linked Data API created for data.gov.uk to provide simple RESTful APIs into your RDF data. Choice opens up more integration opportunities as well as encouraging engagement. The power of SPARQL and other tools is fantastic, but that power is not needed by every developer in every application. Be inclusive when opening up data.
A potentially larger issue is that much of the data available as Linked Data is either static, irregularly updated, or already available in other more accessible formats and APIs. This isn’t true across the cloud as a whole, but timeliness is an issue in many areas. It’s a consequence of the early boot-strapping process which emphasised conversions of available data dumps, and the wrapping of existing APIs and services. As a boot-strapping process that has been fantastic. But it’s not driving engagement: why use data if you can get it somewhere else easier, and in a more up to date form, using tools that you’re already familiar with?
I also think that this is contributes to the reason why it has been difficult to show the power of Linked Data: many of the demonstration apps could easily have been built with other APIs. I think this could be on the cusp of changing as there is now a critical mass of information available to do some powerful queries, and an increasing amount of data is now becoming primarily available as Linked Data.
The challenge we face is changing the nature of the Linked Data cloud from what is a largely static and slow moving environment to one that is much more lively and real-time.
Sustainability
The third challenge I highlighed was sustainability. It’s easy to look at the Linked Data diagram and think: “Well, those bits are done, all we need to do is look how to grow the diagram. We just need to add more data”. I think that’s a natural but unfortunately misleading viewpoint: we need to look carefully at our foundations.
Not all of these sources are on infrastructure that could support real, high volume usage. And few of the datasets are clearly licensed. I’ve personally encountered a number of occasions where some significant datasets are offline or unavailable. So we need to be realistic about whether people can build a stable, commercial application against the web of data as it exists today.
Again to solve this, we need an increasing number of primary sources, making high quality data available on a regular and timely basis, backed by the ability or commitment to deliver those services at the scale we will all eventually require.
In reality this challenge isn’t unique to Linked Data. It’s largely true of the web as a whole; after all not every web site or application is intended to scale to high volume usage. But we’re now talking about a potentially much deeper integration between different applications. We can see the same issues occuring around APIs and data access in general. In recent months there have been a number of stories of developers scrabbling to adapt as APIs get changed, taken down, restricted or re-licensed leaving them high and dry.
To me the beauty of Linked Data, and RDF specifically, in this regard is that it is so much more portable than any other format. This means that we can easily replicate data to share the load of providing access. With Linked Data we have the option of federating or sharing data across the web. (One of the reasons we started the Talis Connected Commons scheme was to help create sustainability around Public Domain datasets.)
The portability of RDF also makes it easier for a range of organisations to offer scaleable value-added services over the same datasets. For the first time we can decouple the curation of data from the delivery of services over that data.
So those are my three challenges. I think these are largely point in time issues, but we’re going to have to work at them to move forward.
What about the opportunities?
Become a Hub
One of the interesting properties of the Linked Data cloud diagram is how it clearly illustrates the emergence of a number of hubs — like dbpedia — that form the focal points for links from a number of different datasets. If you look closely you can also see that there are emerging hubs within specific subject domains.
I wonder whether the hubs that we see today will continue to play such a key role as the web of data evolves? My feeling is that in a few years time the picture and connectivity is going to be quite different. Particularly if we continue to see engagement from government and other sectors.
There is clearly an opportunity here for organisations who are already key enablers within a particular sector to become a linking hub on the web of data.
If you poke around in any industry, its not hard to find organisations who act as the “switchboard” for that particular sector. Either because they manage some key identifiers for the sector as a whole, or because their identifiers and systems have become de facto standards for achieving interoperability. It would be a natural step for those organisations to carry that role forward to the web of data, retaining that key position.
Clearly not everyone can be a significant hub like Dbpedia. But every organisation can act as a hub for its community of customers, partners and users.
The reasons and benefits for doing this are well documented: opening up data can drive new business, innovation, and traffic. Success on the web involves giving your organisation the greatest possible surface area and points of attachment. Linked Data is an excellent way to achieve this as to emphasises the right forms of web integration.
Turn Identifiers into Channels
Linked Data requires you to assign URLs to identify things: people, places, events, whatever. Generally we tend to focus on how that is an important step to publishing data: concentrating on the mechanics of what makes a good, stable identifier and highlighting how this becomes a key way for other people to find your data.
What this misses is that those identifiers can also become channels, or hooks, for your organisation to find other people’s data. Once you have published Linked Data and it becomes linked to by other datasets all of that external data annotates and enriches your own, providing valuable and useful context. Linking data creates network effects, and everyone in the network benefits. That includes you.
The external data is easily accessible through link discovery so it becomes much easier to find, aggregate and analyse it for a variety of purposes. That might be to drive new product features, or to simply power business intelligence and analysis within the enterprise.
I tend to think of it as being able to fish the web of data for useful context. Your URIs are the hooks. Your data is the bait.
I stopped to draw a parallel here with some comments made by Dion Hinchliffe in his opening keynote. Hinchcliffe pointed to the rise of a number of startups and tools supporting analysis of data collected from the open web, perhaps mixed with data from internal enterprise systems. The end results of that analysis is new data and insights that will need to be integrated into an organisations core systems, especially if the intent is to drive more than just management reports.
My prediction was that over the next 12-24 months we’ll begin seeing this type of third-party organisations not just offering SaaS access to analysis systems, but direct insights that are already integrated into a customer’s data via the public identifiers its sharing as Linked Data. This has huge potential value and can completely change the costs and approach to data integration.
The time scales may be completely off. But there’s a real opportunity there in my opinion, particularly for organisations that do market and social media analysis.
Data as a Service
It’s been said before but its worth repeating: Linked Data isn’t necessarily Open Data. The technology is not at odds with exploring business models around data services or access.
The “Data as a Service” (DaaS) idea is gaining momentum in a number of different areas with an increasing number of commercial APIs coming online. We should also soon be seeing commercially available services directly powered by open data sources or through mining those sources.
There are a number of different business models that can be wrapped around data access, ranging from charging for the data itself, through cost recovery for service provision — something that may be relevant for long term usage of government sources — or just charging for delivering reliable, high performance services over open data. There are good reasons why developers may want to pay for reliable services.
Clearly open, sponsored access to data and services will remain an important part of the ecosystem. In fact some level of open data is required to drive the network effects we are seeing around Linked Data: the identifiers and some key metadata needs to be open and remain open; but additional “depth” could be available at a premium.
Summing up
I had no big conclusions to draw from my talk as my goal was to highlight the challenges and opportunities ahead. Clearly I could have chosen a different mix but drawing on my recent experiences engaging with a wide range of different organisations these are the issues and opportunities I’ve most commonly encountered and discussed.
Do you have a different perspective? Perhaps some ideas about how to face these challenges, or a different view of the immediate opportunities? If so, I’d love to hear from you.



