Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Author Archive

Challenges and Opportunities for Linked Data

Yesterday I gave a short talk at Online Information 2010 titled “Challenges and Opportunities for Linked Data” (abstract). The presentation highlighted what I saw as the main challenges that face us as we grow the web of data, and highlighted some opportunities for organisations that want to get involved.

I believe there will be video from the various presentations online at some point, but wanted to post a transcript of what I said (or had planned to say!). The slides are up on slideshare if you’re interested, although they’re largely just transitions to highlight my main themes.

Introduction

2010 has certainly been the year of Linked Data. I’ve been working with RDF and Semantic web technologies for about 10 years now, and its clear that the last 12 months have been one of the critical growth points for Linked Data and the semantic web as a whole. There has been more debate, engagement, and publication of data over than ever before.

This is in no small part due to the fantastic work that has taken place at data.gov.uk. The project has not only championed the approach but also lead the way as an exemplar for how to do this stuff really well. The adoption of RDFa by Facebook, Google and others has also created a much needed feedback loop that is driving the publication of more structured data.

But as the technology grows we’re starting to experience growing pains which are presenting challenges for further growth and adoption. I think we’re also getting a sense of the opportunities that may arise from the web of data. I picked out three key challenges to review in the presentation.

Craft

The first of these relates to what I’d call “the craft” of Linked Data. To date the growth of the Linked Data cloud has largely been driven by skilled artisans — from academia and a small number of commercial organisations — who know how to work with the technology, how to use and manipulate the data that is already available, and how to get things online and linked together in a way that achieves the 5 star approach.

To scale beyond the initial Linked Data community we need to move from an artisan lead approach and enable “journeyman” developers to achieve the same things. There are several facets to this skills transfer.

Tooling is clearly one important area. It’s a truism that Linked Data tools aren’t as polished as they might be. After all it’s still a relatively new technology area. The majority of Linked Data artisans have been happy enough either to make their own tools or to work with a disparate selection of tools to get the job done. But there is still a lot more work to do in creating a more integrated toolkit that journeyman developers can reach into to help them quickly and easily publish data.

To be fair though, I think we’ve needed these past few years of publishing and experimentation to really highlight what those basic tools might be.

The other aspect of craft is education and training. There’s still a relatively small community with deep skills in this area, so thought has to be given to the ability to transition wider. Having helped train and advise a number of team and organisations over the past few years, most recently as part of our consulting work at Talis, its clear that there’s a journey or apprenticeship that many teams and organisations undertake as they begin to experiment and gain experience with the technology.

Within the Linked Data community we need to prioritise the work on these tools and services to make it easier for others. We also need to devote additional work to help nuture or define more standard vocabularies for publishing specific types of data. In my opinion this is the real challenging work: it’s not as fun or exciting as publishing the next new dataset or exemplar, but it’s absolutely necessary to push things to the next level. It’s going to take real commitment from all of us.

In my mind there is no better way to help pass on the skills of the initial artisan community than by encoding that knowledge in the form of tools, vocabularies, best practices and design patterns.

Fuelling Applications

Linked Data isn’t being used as much as it could or should be. Why is this?

I think there are two reasons. The first relates to my previous point about enabling the “journeyman” developer. Right now it takes a certain amount of skill to get the most from Linked Data and SPARQL. This presents a road-block for developers who may be interested in using some of the available data. It may even stop them looking at all.

To solve this we must be ready to meet people half-way. Publish simple JSON formats alongside the RDF. Use the Linked Data API created for data.gov.uk to provide simple RESTful APIs into your RDF data. Choice opens up more integration opportunities as well as encouraging engagement. The power of SPARQL and other tools is fantastic, but that power is not needed by every developer in every application. Be inclusive when opening up data.

A potentially larger issue is that much of the data available as Linked Data is either static, irregularly updated, or already available in other more accessible formats and APIs. This isn’t true across the cloud as a whole, but timeliness is an issue in many areas. It’s a consequence of the early boot-strapping process which emphasised conversions of available data dumps, and the wrapping of existing APIs and services. As a boot-strapping process that has been fantastic. But it’s not driving engagement: why use data if you can get it somewhere else easier, and in a more up to date form, using tools that you’re already familiar with?

I also think that this is contributes to the reason why it has been difficult to show the power of Linked Data: many of the demonstration apps could easily have been built with other APIs. I think this could be on the cusp of changing as there is now a critical mass of information available to do some powerful queries, and an increasing amount of data is now becoming primarily available as Linked Data.

The challenge we face is changing the nature of the Linked Data cloud from what is a largely static and slow moving environment to one that is much more lively and real-time.

Sustainability

The third challenge I highlighed was sustainability. It’s easy to look at the Linked Data diagram and think: “Well, those bits are done, all we need to do is look how to grow the diagram. We just need to add more data”. I think that’s a natural but unfortunately misleading viewpoint: we need to look carefully at our foundations.

Not all of these sources are on infrastructure that could support real, high volume usage. And few of the datasets are clearly licensed. I’ve personally encountered a number of occasions where some significant datasets are offline or unavailable. So we need to be realistic about whether people can build a stable, commercial application against the web of data as it exists today.

Again to solve this, we need an increasing number of primary sources, making high quality data available on a regular and timely basis, backed by the ability or commitment to deliver those services at the scale we will all eventually require.

In reality this challenge isn’t unique to Linked Data. It’s largely true of the web as a whole; after all not every web site or application is intended to scale to high volume usage. But we’re now talking about a potentially much deeper integration between different applications. We can see the same issues occuring around APIs and data access in general. In recent months there have been a number of stories of developers scrabbling to adapt as APIs get changed, taken down, restricted or re-licensed leaving them high and dry.

To me the beauty of Linked Data, and RDF specifically, in this regard is that it is so much more portable than any other format. This means that we can easily replicate data to share the load of providing access. With Linked Data we have the option of federating or sharing data across the web. (One of the reasons we started the Talis Connected Commons scheme was to help create sustainability around Public Domain datasets.)

The portability of RDF also makes it easier for a range of organisations to offer scaleable value-added services over the same datasets. For the first time we can decouple the curation of data from the delivery of services over that data.

So those are my three challenges. I think these are largely point in time issues, but we’re going to have to work at them to move forward.

What about the opportunities?

Become a Hub

One of the interesting properties of the Linked Data cloud diagram is how it clearly illustrates the emergence of a number of hubs — like dbpedia — that form the focal points for links from a number of different datasets. If you look closely you can also see that there are emerging hubs within specific subject domains.

I wonder whether the hubs that we see today will continue to play such a key role as the web of data evolves? My feeling is that in a few years time the picture and connectivity is going to be quite different. Particularly if we continue to see engagement from government and other sectors.

There is clearly an opportunity here for organisations who are already key enablers within a particular sector to become a linking hub on the web of data.

If you poke around in any industry, its not hard to find organisations who act as the “switchboard” for that particular sector. Either because they manage some key identifiers for the sector as a whole, or because their identifiers and systems have become de facto standards for achieving interoperability. It would be a natural step for those organisations to carry that role forward to the web of data, retaining that key position.

Clearly not everyone can be a significant hub like Dbpedia. But every organisation can act as a hub for its community of customers, partners and users.

The reasons and benefits for doing this are well documented: opening up data can drive new business, innovation, and traffic. Success on the web involves giving your organisation the greatest possible surface area and points of attachment. Linked Data is an excellent way to achieve this as to emphasises the right forms of web integration.

Turn Identifiers into Channels

Linked Data requires you to assign URLs to identify things: people, places, events, whatever. Generally we tend to focus on how that is an important step to publishing data: concentrating on the mechanics of what makes a good, stable identifier and highlighting how this becomes a key way for other people to find your data.

What this misses is that those identifiers can also become channels, or hooks, for your organisation to find other people’s data. Once you have published Linked Data and it becomes linked to by other datasets all of that external data annotates and enriches your own, providing valuable and useful context. Linking data creates network effects, and everyone in the network benefits. That includes you.

The external data is easily accessible through link discovery so it becomes much easier to find, aggregate and analyse it for a variety of purposes. That might be to drive new product features, or to simply power business intelligence and analysis within the enterprise.

I tend to think of it as being able to fish the web of data for useful context. Your URIs are the hooks. Your data is the bait.

I stopped to draw a parallel here with some comments made by Dion Hinchliffe in his opening keynote. Hinchcliffe pointed to the rise of a number of startups and tools supporting analysis of data collected from the open web, perhaps mixed with data from internal enterprise systems. The end results of that analysis is new data and insights that will need to be integrated into an organisations core systems, especially if the intent is to drive more than just management reports.

My prediction was that over the next 12-24 months we’ll begin seeing this type of third-party organisations not just offering SaaS access to analysis systems, but direct insights that are already integrated into a customer’s data via the public identifiers its sharing as Linked Data. This has huge potential value and can completely change the costs and approach to data integration.

The time scales may be completely off. But there’s a real opportunity there in my opinion, particularly for organisations that do market and social media analysis.

Data as a Service

It’s been said before but its worth repeating: Linked Data isn’t necessarily Open Data. The technology is not at odds with exploring business models around data services or access.

The “Data as a Service” (DaaS) idea is gaining momentum in a number of different areas with an increasing number of commercial APIs coming online. We should also soon be seeing commercially available services directly powered by open data sources or through mining those sources.

There are a number of different business models that can be wrapped around data access, ranging from charging for the data itself, through cost recovery for service provision — something that may be relevant for long term usage of government sources — or just charging for delivering reliable, high performance services over open data. There are good reasons why developers may want to pay for reliable services.

Clearly open, sponsored access to data and services will remain an important part of the ecosystem. In fact some level of open data is required to drive the network effects we are seeing around Linked Data: the identifiers and some key metadata needs to be open and remain open; but additional “depth” could be available at a premium.

Summing up

I had no big conclusions to draw from my talk as my goal was to highlight the challenges and opportunities ahead. Clearly I could have chosen a different mix but drawing on my recent experiences engaging with a wide range of different organisations these are the issues and opportunities I’ve most commonly encountered and discussed.

Do you have a different perspective? Perhaps some ideas about how to face these challenges, or a different view of the immediate opportunities? If so, I’d love to hear from you.

data.gov.uk and the Talis Platform

Earlier this year Gordon Brown appointed Tim Berners-Lee as an advisor to the Cabinet Office to help the government begin the process of opening up its data. This was one part of the initiation of a project to begin opening up UK government data in a similar style to the US. A key part of Berners-Lee’s vision for putting government data online has been Linked Data which promises to provide a much richer way for citizens to begin accessing, browsing, and using government data.

Several other governments have begun opening up data assets including Australia and New Zealand. These approaches mirror that of the US data.gov site, providing a browsable directory of datasets and links to raw data downloads in a range of different formats. The preview launch of data.gov.uk which was announced at the end of September also includes a directory of datasets which is powered by the software underlying the Comprehensive Knowledge Archive Network. But the site also aims to fulfill Berners-Lee’s vision and in addition provide access to some datasets as Linked Data through SPARQL endpoints.

We’re very pleased to report that the Talis Platform is currently underpinning the delivery of all of the Linked Data and SPARQL endpoints for the data.gov.uk site.

We’ve been quietly supporting the effort for several months now helping out with data management, modelling discussions, and with training on the core technology. There seems to be a very definite appetite in government to not only open the raw data but to also explore the potential for Linked Data. Its clear from today’s announcement about opening up additional aspects of the Ordnance Survey data that there’s a real focus on delivering on the open data promise. While there are certainly some high-profile datasets like the Ordnance Survey or postcode data that may require legislative changes to become open, one of the biggest implementation challenges facing government is pulling together an overall directory of datasets and spreadsheets that are already scattered across multiple departmental websites.

Creating a dataset directory provides the required basic level of infrastructure to allow reuse, by enabling developers to find what they need; publishing Linked Data, SPARQL endpoints, and potentially extra APIs provides an additional set of options for ways to access the data. By letting datasets be browsable by anyone, not just developers, Linked Data offers the potential for anyone to find, discover and reuse interesting datasets. As I illustrated in a recent talk, these approaches are not mutually exclusive and the goal should be maximum utility.

Over on the Talis Platform developer blog we’ve begun showing some ways that the initial datasets, covering UK schools and traffic measurements can be queried in interesting ways. Its been exciting to see people begin to pick up the technology and creating reporting tools to explore the data, but also fantastic to be able to easily view data using only a browser.

There’s clearly still a great deal of work ahead, but the ground work has now been completed: there’s infrastructure in place to support data publishing; official guidelines on creating public sector URIs; and some agreement on best practices for modelling statistical data. The next challenge is to start ramping up the conversion of currently open data into RDF, in order to begin expanding the coverage of the Linked Data.

This is a very exciting project and here at Talis it’s something in which we’re very proud to be playing a role.

Linked Data and News Innovation

Whilst attending the recent NewsInnovation event I gave a lightning talk about Linked Data. The talk was proceeded by an introduction to the Guardian Open Platform which reviewed their content and data publishing system, and some of their plans for future development. This set the scene really well as I argued that Linked Data was a natural extension of what the Guardian are doing, and in my half of the session gave a quick overview of Linked Data and its relevance for driving innovation around news reporting. The session was really successful, we had a 25 minute slot and ended up having an interesting discussion about Linked Data, trust, provenance and related issues that ran on for a whole hour; I’m really pleased with how well it went. Especially as I only put the slides together on the way to the event!

My short deck of slides are now up on Slideshare, and in the rest of this blog post I’ll briefly summarise the talk.

I opened by speaking about the fundamental idea behind Linked Data: that data be put online, in a very fine-grained way. This takes us beyond having stable links for datasets or just articles, and yields web identifiers for the Who, Why, What, Where and When of the content: every person; place; category; and event can each be identified, annotated and ultimately linked together into a navigable whole. RDF, as the core technology for Linked Data, is very simple to get to grips with, with the notion of resources and their connections being something that anyone can intuitively grasp in a few minutes.

Readers of this blog will already be aware of the success of the Linked Data movement, and a large and growing amount of data is available for people to use and re-use in their applications. Quality varies considerably across the Linked Data web, but ultimately this is the nature of any web based system. With the growing engagement from organizations like the BBC, Library of Congress, and the New York Times, the availability of good quality data is only going to increase.

So in what way is Linked Data useful for driving increasing innovation and change in the way that news is created, reported and accessed?

Well there are some obvious answers around providing new ways to search and discover relevant content, e.g. everything about a specific individual or place. But there are two specific areas where I think Linked Data is important to driving innovation around news. The first is context, the second provenance.

Using Linked Data we can take a mesh of inter-related facts and figures and wrap it in a narrative that can help others understand that information and its relationships. Trends can be observed and reported on; data can be summarized along with a particular perspective. What’s important about Linked Data is that this contextualisation can happen without losing the assocation between the narrative and the underlying resources — the Who, What, Why, Where and When. Because those links are preserved then the reader has the ability to drill down into the underlying data in order to inspect that data for themselves. The reader can also find other narratives that draw on the same set of data, discovering extra context and alternate viewpoints much more easily. This creates a rich fabric for allowing for navigation between stories and their referents.

The other aspect is Provenance, or more simply: the ability to back-track to the source of some content. If the news were presented as Linked Data then would be able to explore not just relationships between the content, but also journalists and their affiliations. As readers we’ll be able to gain context not just on the stories, but also on the people that are producing them. Through the ability to drill-down into the underlying data, we are presented with the opportunity to confirm conclusions; we can fact check stories for ourselves. The ability to identify and ignore questionable sources, or identify stories that are drawn from inaccurate data or analyses, is something that has been previously been very hard to do.

Issues like context, provenance, and trust are all areas that the Linked Data and semantic web community are actively exploring and have been so some time. I don’t see any other approaches that are really addressing that space. There is clearly lots of interesting work happening around helping people tell stories with data, and understand the context of news stories (e.g. journalisted), but these are largely disconnected efforts: Linked Data should provide a framework for connecting all that together. IMO, this is an area where Linked Data can add real value in a number of different ways.

Open For Business

Here at Talis, we’re passionate about the web, open standards and Linked Data, to name but a few. This passion has manifested itself as evangelism at conferences and through our own blogging and publications. Everything you’d expect from a company that seriously believes in shared innovation.

Where we’ve been a little more cautious is around the promotion of the Talis Platform, our own semantic web offering. The Platform has been in service for several years now supporting our own product development and innovation but, while we’ve been happy to discuss its features, and to provide developers with access to the Platform for research and prototyping, we’ve been holding off from discussing the commercial aspects of the Platform. There have been several reasons for this.

Firstly there’s the usual story around engineering and product development. This is a continual process, but there milestones that we felt we had to achieve. These vary from the small — improvements to the general consistency of the API, for example — through to some larger architectural changes. For example, yesterday saw our 23rd monthly Platform release which provided us with a much faster and more scaleable SPARQL endpoint. Existing stores will be gradually and seamlessly transitioned to the new infrastructure and we’ll be capitalizing on these changes over the next few releases to make some additional improvements.

Secondly, and perhaps more fundamentally, we’ve been taking our time over developing our own understanding of how the Linked Data ecosystem is evolving, the infrastructure required to support that ecosystem, and where we can best provide support to help promote shared innovation around the emerging linked data web.

Lastly, we’ve also been deciding how much we can give away for free! Last month we announced the Talis Connected Commons scheme which provides completely free Platform hosting for public domain datasets. This is a fantastic offer that provides for unlimited use of the Platform for datasets up to 50 million triples. That’s a lot of space to capture some really interesting data. Aside from that we’re also committed to ensuring that some aspects of the Platform API, such as the free text search, augmentation, and linked data access will also remain freely available, regardless of the terms of the data hosting.

We refreshed the Platform website yesterday to provide more information on this and other aspects of the service. For example there’s now a high level overview for developers, some discussion of how the Platform can be put to good use, and a short FAQ that address some common questions.

As the licensing page notes, we’re now offering to host data in the Platform on a commercial basis. The pricing is based on a utility model, so you pay for the amount of storage and service usage you make. We’re currently working on finalising the terms and conditions and service level agreement around the Platform so that we can share these publically too: this is another area that we want to take our time and make sure we get things right.

Over the next week or so I hope to post some more information about how the Platform is already being used, as well as discussing some of the exciting commercial projects we have underway.

In the meantime, if you’re interested in signing up for the Connected Commons, want developer access, or want to discuss commercial uses of the Platform further, then please get in touch.

Announcing the Talis Connected Commons

Here at Talis we’re very pleased and excited to be announcing a new scheme that we’re calling the Talis Connected Commons.

We’ve invested a lot of time and energy over the last few years in evangelising the importance of linked open data. Along the way we’ve funded development of open data licenses to help provide the legal framework to support open data projects, and have followed our own advice and shared data with the communities surrounding our own products. And throughout this time we’ve been hard at work not only building the Talis Platform, but also using its flexibility to re-develop our own products.

We felt it was time to start bringing those two strands together and allow other people to really start using the Platform. For a while now we’ve let a number of developers have access to the platform for the purposes of prototyping and experimentation, but we recognise that for the Platform to become a serious component in the semantic web infrastructure that it needs to be offered on a more formal basis. The Talis Connected Commons scheme is the first step towards achieving this, and we think its a big one; not only for us, but also for the open data community in general.

True to our desire to see a truly open web of data, under the terms of the Connected Commons scheme Talis is offering free access to the Platform for the purposes of hosting public domain data. And the offer isn’t just limited to free hosting: the data access services, including access to a public SPARQL endpoint, are also freely available.

The terms of the offer are as follows: if you own, or are creating, a public domain dataset then you can store that data in the Platform as RDF, for free. We’re setting an initial cap of 50 million triples on each dataset, but thats should be plenty of space in which to collect some really interesting data. To qualify for the scheme, you need to be using either the Open Data Commons Public Domain Dedication and License or the recently launched Creative Commons CC0 license to publish your data. Anyone will then be able to freely access the stored data using the Platform services, without API keys and without usage limits. This means that your data will be wrapped in a ready made API right from the start.

The Platform API covers basic data management facilities, through to a configurable search engine and a fully compliant SPARQL endpoint. And with data being delivered in a range of formats including RDF/XML and JSON, there should be something there for everyone to get their teeth into no matter what kind of application you’re building or environment you’re working in.

For more information on the details of the offer visit the Connected Commons homepage. We’ve prepared a lengthy set of frequently asked questions that should hopefully clarify any other questions you might have. If not, then feel free to send in a comment and we’ll try and address your questions.

Enabling the Linked Data Ecosystem

|This post will feature in Nodalities Magazine, Issue 5

The Linked Data web might usefully be viewed as an incremental evolution beyond Web 2.0. Instead of disconnected silos of data accessible only through disconnected custom APIs, we have datasets that are deeply connected to one another using simple web links, allowing applications to “follow their nose” to find additional relevant data about a specific resource. Custom protocols and data formats are the realm of the early web; the future of the web is in an increased emphasis on standards like HTTP, URIs and RDF that ironically have been in use for many years.

Describing this as a “back to basics” approach wouldn’t be far wrong. Many might dispute that RDF is far from simple, but this overlooks the elegance of its core model. Working within the constraints of standard technologies and the web architecture allows for a greater focus on the real drivers behind data publishing: what information do we want to share, and how is it modelled?

Answering those questions should be relatively easy for any organisation. All businesses have useful datasets that their customers and business partners might usefully access; and they have the domain expertise required to structure that data for online reuse. And, should any organisation want some additional creative input, the Linked Data community has also put together a shopping list [1] to highlight some specific datasets of interest. This list is worth reviewing alongside the Linked Data graph [2], to explore both the current state of the Linked Data web and the directions in which it is potentially going to grow.

Beyond the first questions of what and how to share data, there are other issues that need to be considered. These range from internal issues that organisations face in attempting to justify the sharing of data online, through to larger concerns that may impact the Linked Data ecosystem. For the purposes of this of article, this ecosystem can be divided up into two main categories: data publishers, who publish and share information online; and data consumers, who make use of these rich datasets.

There is obvious overlap between these two categories: many organisations will fall into both camps, as do we all through our personal contributions to the web. However, for this paper I want to focus primarily on business and organisational participants, and attempt to illustrate the different issues that are relevant to these  roles.

Data Publishers Perspective

The first issue facing any organisation is how to justify both the initial and ongoing effort required to support the publishing of Linked Data. Depending on existing infrastructure this may range from a relatively small effort to a major engineering task—particularly true if content has to be converted from other formats or new workflows introduced. In “A Call to Arms” in the last issue of Nodalities [3], John Sheridan and Jeni Tennison provided some insight into how to address the technology hurdle by using technologies like RDFa.

But can this effort be made sustainable? Can the initial investment and ongoing costs be recouped? And, if a dataset becomes popular and grows to become very heavily used, can the infrastructure supporting the data publishing scale to match?

The general aim with enabling access to data is that it will foster network effects, and drive increasing traffic and usage towards existing products and services. There are success stories aplenty (Amazon, Ebay, Salesforce, etc) that illustrate that there is real and not imagined potential.

But this justification overlooks some important distinctions. Firstly for some organisations, e.g. charities and non-governmental organisations information dissemination is part of their mission and there may not be other chargeable services to which additional traffic may be driven. In this scenario everything must be sustainable from the outset. Secondly, it also overlooks the fact that the data being shared may itself be an asset that can be commoditised. The value of access to raw data, stripped of any bundling application, has never been clearer, or been easier to achieve. New business models are likely to arise around direct access to quality data sources. Simple usage-based models are already prevalent on a number of Web 2.0 services and APIs—the free basic access fosters network effects, while the tiered pricing provides more reliable revenue for the data publisher.

Software as a service and cloud computing models undoubtedly have a role to play in addressing the sustainability and scaling issue, allowing data publishers to build out a publishing infrastructure that will support these operations without significant capital investments. But few of the existing services are really firmly targeted at this particular niche: while computing power and storage are increasingly readily available, support for Linked Data publishing or metered access to resources are not yet common-place.

This is where Talis and the Talis Platform have a distinct offering: by supporting organisations in their initial exploration of Linked Data publishing, with a minimum of initial investment, and a scaleable, standards based infrastructure, it becomes much easier to justify dipping a toe into the “Blue Ocean” (see Nodalities issue 2 [5]).

Data Consumers Perspective

Let’s turn now to another aspect of the Linked Data ecosystem, and consider the data consumers perspective.

One issue that quickly becomes apparent when integrating an application with a web service or Linked Dataset is the need to move beyond simple “on the fly” data requests,  e.g. to compose (“mash-up”) and view data sources in the browser, towards polling and harvesting increasingly large chunks of a Linked Dataset.

What drives this requirement? In part it is a natural consequence and benefit of the close linking of resources: links can be mined to find additional relevant metadata that can be used to enrich an application. The way that the data is exposed, e.g. as inter-related resources, is unlikely to always match the needs of the application developer who must harvest the data in order to index, process and analyse it so that it best fits the use cases of her application.

Creating an efficient web-crawling infrastructure is not an easy task, particularly as the growth of the Linked Data web continues and the pool of available data grows. Technologies like SPARQL do go some way towards mitigating these issues, as a query language allows for more flexibility in extracting data. However provision of a stable SPARQL endpoint may be beyond the reach of smaller data publishers, particularly those who are adopting the RDFa approach of instrumenting existing applications with embedded data.  SPARQL also doesn’t help address the need to analyse datasets, e.g. to mine the graph in order to generate recommendations, analyse social networks, etc.

Just as few applications carry out large scale crawling of the web, instead relying on services from a small number of large search engines, it seems reasonable to assume that the Linked Data web will similarly organise around some “true” semantic web search engines that provide data harvesting and acquisition services to machines rather than human users. Issues of trust will also need to be addressed within this community as the Linked Data web matures and becomes an increasing target for spam and other malicious uses. Inaccuracies and inconsistencies are already showing up.

The Talis Platform aims to address these issues by ultimately providing application developers with ready access to Linked Datasets, avoiding the need for individual users and organisations to repeatedly crawl the web. Value-added services can then be offered across these data sources, allowing features, such as graph analysis (e.g. recommendations), to become commodity services available to all. The intention is not to try and mirror or aggregate the whole Linked Data web, this would be unfeasible, but rather collate those datasets that are of most value and use to the community, as well as shepherding the publishing of new datasets by working closely with data publishers.

As an intermediary, the Talis Platform can also address another issue: that of scaling service infrastructure to meet the requirements of data consumers without requiring data publishers to do likewise. It seems likely that data publishers may ultimately choose to “multi-home” their datasets, e.g. publishing directly onto the Linked Data web and also within environments such as the Talis Platform in order to allow consumers more choice in the method of data access.

Conclusions

The bootstrapping phase of the Linked Data web is now behind us. As a community, we need to begin considering the next steps, especially as the available data continues to grow.  This article has attempted to illustrate a few from a wide range of different issues that we face. While technology development, particularly around key standards like SPARQL, rules and inferencing, and the creation of core vocabularies, will always underpin the growth of the semantic web, increasingly it will be issues such as serviceable infrastructure and sustainable business models that will come to the fore.

At Talis we are thinking carefully about the role we might play in addressing those issues and playing our part in enabling the Linked Data ecosystem to flourish.

[1]. http://community.linkeddata.org/MediaWiki/index.php?ShoppingList
[2]. http://richard.cyganiak.de/2007/10/lod/
[3]. http://www.talis.com/nodalities/pdf/nodalities_issue4.pdf
[4]. http://labs.google.com/papers/bigtable.html
[5]. http://www.talis.com/nodalities/pdf/nodalities_issue2.pdf

Next Generation Business Intelligence at ISWC 2008

The second pre-conference session I attended at ISWC 2008 was a tutorial session on “Knowledge Representation and Extraction for Business Intelligence“.

I attended the session as I was curious to learn about more applied uses of Semantic Web technology particularly in the financial and business context. In terms of content the tutorial veered wildly from overview material through to some quite detailed looks at linguistic and semantic analysis to extract information from business reports. To that end I’m not going to attempt to summarize the full content of the tutorial but will pick out a few areas of interest.

Somne time was spent on looking at XBRL, the standard business reporting language which is becoming increasingly adopted around the world as a standard means to publish and share business reports. The initiative which began in 1999 was recently extended this year to include a European XBRL consortium. The broad goal of the project is to standardize the means and structure of publishing business financial reports with the goal of making it easy to compare and collate reports for regulatory and other purposes. The current financial crisis was referenced as an illustration of the need for greater transparency in business reporting and is an obvious driver for adoption of the technology.

XBRL draws on many of the same concepts as the Semantic Web, in particular the use of “taxonomies” that can be customized by specific businesses, sectors and regulatory areas, but uses XML technologies like XML Schema rather than RDF. There is growing interest in being able to capture this information using RDF and in mapping XBRL taxonomies into Semantic Web ontologies. For example there has been some early work on an the XBRL ontology, as well as some independent exploration and signs that a W3C incubator or interest group might be formed. The speaker at the tutorial also suggested that before long some standard GRDDL connectors would be available to automate the transformation of XBRL documents into RDF.

Much of the tutorial was discussion of applied uses of RDF data and ontologies within the context of the Musing Project an EU funded project exploring “next-generation business intelligence” in the areas of financial risk management, internationalisation and IT operational risk. Some of the applications that have been explored have been collecting company info from a range of multilingual sources; attempting to assess chances of success of a business in a specific region; semi-automated form filling, e.g. for returns; identifying appropriate business partners; and reputation tracking and opinion mining.

Many of the issues faced in the Musing project deal with how to assemble this data with a historical context: while XBRL data may be present for current or recent years, text mining is required to extract this data from historical reports. The last part of the tutorial was a general introduction to Information
Extraction using the Gate toolkit (this starts from around Slide 75 in the Powerpoint slides). This was a good overview of the capabilities of the toolkit and showed some nice use cases. OpenCalais certainly isn’t the only game in town and, while Gate requires more effort to set-up, looks like it could provide a great deal more customisation options for businesses that really need the extra power.

One of the telling things about the overall process was the need to collate useful data from a number of different sources in order to drive the information extraction process. In order to do Named Entity Extraction a good set of reference material is required, e.g. Gazetteers for place names, or lists of people’s names. While much of this data is already available — in Musing they drew on Wikipedia and the CIA World Factbook for example — a lot more information was either available only by crawling the web or from commercial resources. This suggests to me that there’s still a some ground work to be done in unlocking more data sets that can help drive the business intelligence use cases. There’s essentially a domino effect here: exposing often small focused datasets, can end up unlocking huge potential value further down the line.

Jim Hendler at the INSEMTIVE 2008 Workshop

Along with a number of my colleagues, I’m currently attending the ISWC 2008 conference in Karlsruhe, Germany. Yesterday I attended the INSEMTIVE workshop (“Incentives for the Semantic Web”) which aimed to explore incentives for the creation of semantic web content, i.e. encourage the creation of more structured metadata. The workshop papers are available to browse online or you can download the complete proceedings. There were a real mix of papers, covering specific issues such as extraction of semantics from tagging, and identifying information needs of a community by analysing search patterns, through to position papers that attempted to highlight shortcomings in current semantic web applications that deter people from creating metadata.

I found the position papers most interesting if only because they provided confirmation of something that I’ve been thinking for a while now: that people will (and do) create metadata when there are obvious and immediate benefits in them doing so. No-one really consciously sits down to share or create metadata: they sit down to do a specific task and metadata drops out as a side-effect. For me this makes much of the problem highlighted by the workshop one of interaction design: how do we build good task-oriented user interfaces that encourage the creation of semantic web metadata, and how can we illustrate the benefits of semantic web technologies in an incremental fashion? In my opinion solving this will require close collaboration between semantic web researchers and developers, and interaction designers.

The end of the workshop was a discussion session chaired by Jim Hendler. Hendler chose to do a retrospective of some older presentations to explore how thinking has evolved (or not!) with respect to drivers towards the development of the semantic web.

Starting in 1999, Hendler showed some slides from DAML strategy talks that emphasised the need for a number of different areas to align before a real marketplace can be created for semantic web content and applications. These areas were tools, users, and languages (e.g. OWL, etc). Hendler noted that the Semantic Web community had mistakenly focused too heavily on languages and not enough on the other areas. He also thought that “Web 2.0″ had focused primarily on the users, to a lesser extent on the tools, and very little on the language aspects. Hendler thought that this alignment was now taking place.

Moving forward in time to show some slides from 2001-2002, Hendler introduced the idea that the development of the web itself will “force” the evolution of the semantic web, i.e. that internal pressures, such as the need to better manage and extract value from the massive amounts of online information, will require the semantic web to solve specific problems. Hendler observed that the web has demonstrated that people will do more work to share information with others than they will do to help themselves; i.e. people are lazy. When people want to, need to, or are rewarded for sharing information and content then they will work much harder than they would do to manage and organize information purely for their own uses. Hendler noted that there is a tendency to say “we’ll solve the data creation problem at the individual level, as solving it at a group level is harder to manage”, but a look at web history illustrates that the opposite is in fact the case.

Hendler also shared what he thought was the best piece of advice he’d been given by Tim Berners-Lee: start small but viral and you can change many things. Hendler’s slides characterized this as: “My friend sees it, wants one; My competitor sees it, needs one”.

Looking at slides from 2002, Hendler introduced the “Value proposition” supporting the creation of semantic web data & content, i.e. that there has to be some immediate return on the investment in creating metadata.

Hendler finished his retrospective with a slide from a 2008 talk that showed the range of commercial companies, government projects and vertical sectors that were now heavily engaged in the Semantic Web (I was happy to see Talis mentioned in the list!). In Hendler’s opinion there is a growing excitement, that the “next big thing” is going to come from the Semantic Web; not a “Google Killer”, but the next big revolutionary idea or service. The incentives here being the obvious one: money.

Hendler noted that there is a huge amount of data out there and that finding anything in the mess can be a win. So even a little semantics can make a difference here and could provide some competitive advantages. We don’t need perfect answers or solutions, just incremental improvements on what we have now.

I was also happy to see Hendler encourage researchers to “compete in the real world”, noting that they have to work within the context of a real world that is moving very fast, that they can’t really compete with the resources of commercial firms in creating semantic web applications and demonstrators and should instead try and work within that context to demonstrate real value from the technology. Hendler encouraged them to focus on issues of scalability. Does the fundamental technology scale? Do the concepts and ideas scale to a real user base? As an illustration Hendler noted that he was working with a number of companies that were using some simple OWL constructs in order to add semantics to applications, but that none of them were using a formal reasoner just “little pieces of procedural code that scale really well”.

Overall, an interesting workshop!

Paul Miller did a podcast with Jim Hendler back in March if you want to hear more about his thoughts on the Semantic Web.

The Web’s Rich Tapestry

We’ve all read books that linger in our memories. And there are any number of reasons why they might do so; a stirring tale or thought-provoking argument, for example. One book that has stayed with me over the years is The House of Leaves by Mark Danielweski. It’s been described as “the Blair Witch” of haunted house tales, being the story of a house, the people who live there, and those who attempt to document the strange events and structure of the building. The book is quite a challenging read as it is made up of overlapping narratives, documentary evidence from the investigators, etc. As a reader you’re assembling a narrative out of the interlocking pieces of text that the author presents you with.

But, while the tale is one of those slow burrning horror stories that does linger at the back of the mind, that’s not the primary reason why the book has stayed with me. It was the actual structure of the text that was so intriguing: the author has played with the printed form, including the basic layout of the print on the page in an attempt to further promote the mythology of the story and to help convey the labyrinthine nature of the house. For example a typical page might contain several different blocks of text, and much of the story is told through footnotes and footnotes to footnotes, and footnotes to those footnotes. Certain words are coloured differently throughout the text. There are even blocks of text embedded in the page which you have to read downwards through several pages before returning to your starting point. As a reader you’re physically exploring the text much like the characters are exploring the house.

The book is basically a hypertext novel and while certainly not the first to play with the printed form in this way, it was the first that I’d personally encountered. As a hypertext the book appeals to the technologist in me: I’ve given a number of talks over the past few years and in many of these I’ve explored the evolution of hypertext systems. But I’ve also attempted to challenge people’s pre-conceptions about the medium of the web, just as the House of Leaves challenged my conceptions about the printed medium.

My most recent talk was last week at the ALPSP Internationational Conference 2008 which took place last week in Old Windsor. The talk, titled “The Web’s Rich Tapestry“, discussed the link as the basic medium of the web and reviewed how the blurring of boundaries between websites, services and data (aka “Web 2.0″) is enabled by increasingly richer linking between resources. This is part of a move from old broadcast models of information publishing to a more web-like network of interconnected peers each contributing to a dense information medium. The ultimate endpoint of this inherent in the vision of the Semantic Web, and will complete the change from a document-centric to a data-centric world. The Semantic Web, which is just a layer on top of the existing web, is still based on linking. Albeit linking of a more fine-grained and meaningful nature.

The Semantic Web, just like the existing Web, will arrive through the actions of individuals, organizations and businesses, each contributing to the whole by sharing linked data sets; this process is already happening. And, like the Web, the more data is available, the more value there will be for everyone involved. I urged society publishers to begin more openly sharing their metadata and exploring the potential inherent in the Web of Data. I also attempted to do more than just evangelize the potential benefits of the Semantic Web and also tried to provide a few pointers towards where those benefits might be realized.

One obvious benefit relates to the generation of more traffic to content and services. For many publishers a sizeable, if not the majority, of their website traffic is driven by Google referrals. This is an inherently fragile situation, but one that I believe is ultimately temporary. The scale of this traffic generation is obviously due in major part to the popularity of the Google search engine, but it is enabled by their ability to quickly and efficiently crawl websites in order to index content. This provides a large “surface area” to which Google can generate links. By publishing open data, information providers will be able to grow this surface area by at least an order of magnitude due to the more fine-grained data publishing that the Semantic Web entails. All of this data can potential generate new, highly relevant traffic to content and services.

The other area that the Semantic Web will pay off is by enabling much more sophisticated research and analysis tools, not just for academic researchers and students, but also for all of us in our every day consumption of information. In my view there is too much of a focus on search and not enough on information visualisation and analysis tools. I pointed towards some very recent experiments which I think illustrate some of this potential, including Ubiquity and Freebase Parallax. Talis’s own Project Xiphos is also exploring the innovation that can follow from re-purposing publishing metadata, a topic that was particularly relevant to the ALPSP audience. In my new role as Programme Manager for the Talis Platform, I’m excited to begin exploring how we can start helping businesses to begin drawing value from the rapidly growing Web of Data.