Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Linked Spending Data – How and Why Bother Pt3

linkedlocalgovAs often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted.  So my apologies to those eagerly awaiting this latest.

To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all.  In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.

I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion.  So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies  identifiers.

Ontologies

An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain  -  an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations.  So in our quest to publish spending data what ontology should we use?  The Payments Ontology, with the accompanying guide to it’s application, is what is needed.  Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc.  The next question is how do you identify the things that you are relating together using this ontology.

Lets take this one step at a time:

  1. Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
  2. Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’.  This enables anyone wanting detail about this item to follow the link to our site to get the information.
  3. Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology:
    http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payer
    http://statistics.data.gov.uk/id/local-authority/41UD .

    Note I am using an identifier for the payer that is published by statistics.data.gov.uk.  That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.

  4. Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670 
    payment:payee
    http://spending.lichfielddc.gov.uk/supplier/bristow-sutor .
  5. And then repeat the process for categorisation, payment value etc.

This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number!  – there are, but as of the moment there are no established sets of URI identifiers for these.  OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so.  Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier:
http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs
http://opencorporates.com/companies/uk/01431688 .

Identifiers

A_Colorful_Cartoon_Chicken_Laying_a_Golden_Egg_Royalty_Free_Clipart_Picture_100705-004451-507053 Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation.   I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670 

You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later.  Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616.  If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.

It does matter however, for those looking to the identifying URI for some idea of authority.  As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying.  Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information.   The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit  links between URIs at a later date.

In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many.  An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment.  Yes it is worth bothering, just don’t expect benefits on day, or even month, one.

 

 

 

Linked Spending Data – How and Why Bother Pt2

linkedlocalgovI started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption.   So learning from that experience I am going to start with another assumption in this post.  Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing.  Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.

In this post I am going to be concentrating more on the How than the Why Bother

homeTo help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples.  Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/.   On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council. 

However, it is more than a web site.  Inspection of the Download data tab shows that you can get your hands on the source data in csv format.  Here is one line, representing a line of expenditure, from that data:

"http://statistics.data.gov.uk/id/local-authority/41UD","Lichfield District Council","2010-04-06","7747","http://spending.lichfielddc.gov.uk/spend/8605670","120.00","BRISTOW & SUTOR","401","Revenue Collection","Supplies & Services","Bailiff Fees",""

… which represents the data displayed on this human readable page:

Lichfield District Council Spending Data - Details of payment number 8605670
Looking through the csv, you can pick out the strings of characters for information such as the date, supplier name, department name etc.  In addition you can pick out a couple of URIs:

Linked Data for Lichfield District Council %007C statistics.data.gov.uk In the context of csv, that’s all these URIs are, identifiers.  However because they are http URIs you can click through to the address to get more information.  If you do that with your web browser you get a human readable representation of the data.  These sites also provide access to the same data, formatted in RDF, for use by developers.

Source of http___spending.lichfielddc.gov.uk_spend_8605670.rdf You can see that data by adding ‘.rdf’ to the end of the address, thus: http://spending.lichfielddc.gov.uk/spend/8605670.rdf and then selecting the ‘view source’ option of your browser for the page of gobbledegook that you get back.  

Inspecting the RDF, you will see that most things, except descriptive labels and financial values, are are now identified as URIs such as http://spending.lichfielddc.gov.uk/subjective/bailiff-fees and http://spending.lichfielddc.gov.uk/invoice/7747.  Again if you follow those links, you will get a human readable representation of that resource, and the RDF behind it by adding a ‘.rdf’ suffix.

The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things.  Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.

Secondly, you may have spotted that there is no date, or supplier name or identifier.  That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.

BBC - Wildlife Finder - Whooper swan facts, pictures & stunning videos Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc.  This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to.  Another great example of this effect can be found on a site such as the BBC Wildlife Finder.  That is not to say that you could not have created such a site without even considering Linked Data, of course you could.  However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.

The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority.  By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.

So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s?  The simple answer is that you may not have to – others may help you do some, if not all, of it.   With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are.  As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source.  Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.

As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this.  But that is for Part 3….

Challenges and Opportunities for Linked Data

Yesterday I gave a short talk at Online Information 2010 titled “Challenges and Opportunities for Linked Data” (abstract). The presentation highlighted what I saw as the main challenges that face us as we grow the web of data, and highlighted some opportunities for organisations that want to get involved.

I believe there will be video from the various presentations online at some point, but wanted to post a transcript of what I said (or had planned to say!). The slides are up on slideshare if you’re interested, although they’re largely just transitions to highlight my main themes.

Introduction

2010 has certainly been the year of Linked Data. I’ve been working with RDF and Semantic web technologies for about 10 years now, and its clear that the last 12 months have been one of the critical growth points for Linked Data and the semantic web as a whole. There has been more debate, engagement, and publication of data over than ever before.

This is in no small part due to the fantastic work that has taken place at data.gov.uk. The project has not only championed the approach but also lead the way as an exemplar for how to do this stuff really well. The adoption of RDFa by Facebook, Google and others has also created a much needed feedback loop that is driving the publication of more structured data.

But as the technology grows we’re starting to experience growing pains which are presenting challenges for further growth and adoption. I think we’re also getting a sense of the opportunities that may arise from the web of data. I picked out three key challenges to review in the presentation.

Craft

The first of these relates to what I’d call “the craft” of Linked Data. To date the growth of the Linked Data cloud has largely been driven by skilled artisans — from academia and a small number of commercial organisations — who know how to work with the technology, how to use and manipulate the data that is already available, and how to get things online and linked together in a way that achieves the 5 star approach.

To scale beyond the initial Linked Data community we need to move from an artisan lead approach and enable “journeyman” developers to achieve the same things. There are several facets to this skills transfer.

Tooling is clearly one important area. It’s a truism that Linked Data tools aren’t as polished as they might be. After all it’s still a relatively new technology area. The majority of Linked Data artisans have been happy enough either to make their own tools or to work with a disparate selection of tools to get the job done. But there is still a lot more work to do in creating a more integrated toolkit that journeyman developers can reach into to help them quickly and easily publish data.

To be fair though, I think we’ve needed these past few years of publishing and experimentation to really highlight what those basic tools might be.

The other aspect of craft is education and training. There’s still a relatively small community with deep skills in this area, so thought has to be given to the ability to transition wider. Having helped train and advise a number of team and organisations over the past few years, most recently as part of our consulting work at Talis, its clear that there’s a journey or apprenticeship that many teams and organisations undertake as they begin to experiment and gain experience with the technology.

Within the Linked Data community we need to prioritise the work on these tools and services to make it easier for others. We also need to devote additional work to help nuture or define more standard vocabularies for publishing specific types of data. In my opinion this is the real challenging work: it’s not as fun or exciting as publishing the next new dataset or exemplar, but it’s absolutely necessary to push things to the next level. It’s going to take real commitment from all of us.

In my mind there is no better way to help pass on the skills of the initial artisan community than by encoding that knowledge in the form of tools, vocabularies, best practices and design patterns.

Fuelling Applications

Linked Data isn’t being used as much as it could or should be. Why is this?

I think there are two reasons. The first relates to my previous point about enabling the “journeyman” developer. Right now it takes a certain amount of skill to get the most from Linked Data and SPARQL. This presents a road-block for developers who may be interested in using some of the available data. It may even stop them looking at all.

To solve this we must be ready to meet people half-way. Publish simple JSON formats alongside the RDF. Use the Linked Data API created for data.gov.uk to provide simple RESTful APIs into your RDF data. Choice opens up more integration opportunities as well as encouraging engagement. The power of SPARQL and other tools is fantastic, but that power is not needed by every developer in every application. Be inclusive when opening up data.

A potentially larger issue is that much of the data available as Linked Data is either static, irregularly updated, or already available in other more accessible formats and APIs. This isn’t true across the cloud as a whole, but timeliness is an issue in many areas. It’s a consequence of the early boot-strapping process which emphasised conversions of available data dumps, and the wrapping of existing APIs and services. As a boot-strapping process that has been fantastic. But it’s not driving engagement: why use data if you can get it somewhere else easier, and in a more up to date form, using tools that you’re already familiar with?

I also think that this is contributes to the reason why it has been difficult to show the power of Linked Data: many of the demonstration apps could easily have been built with other APIs. I think this could be on the cusp of changing as there is now a critical mass of information available to do some powerful queries, and an increasing amount of data is now becoming primarily available as Linked Data.

The challenge we face is changing the nature of the Linked Data cloud from what is a largely static and slow moving environment to one that is much more lively and real-time.

Sustainability

The third challenge I highlighed was sustainability. It’s easy to look at the Linked Data diagram and think: “Well, those bits are done, all we need to do is look how to grow the diagram. We just need to add more data”. I think that’s a natural but unfortunately misleading viewpoint: we need to look carefully at our foundations.

Not all of these sources are on infrastructure that could support real, high volume usage. And few of the datasets are clearly licensed. I’ve personally encountered a number of occasions where some significant datasets are offline or unavailable. So we need to be realistic about whether people can build a stable, commercial application against the web of data as it exists today.

Again to solve this, we need an increasing number of primary sources, making high quality data available on a regular and timely basis, backed by the ability or commitment to deliver those services at the scale we will all eventually require.

In reality this challenge isn’t unique to Linked Data. It’s largely true of the web as a whole; after all not every web site or application is intended to scale to high volume usage. But we’re now talking about a potentially much deeper integration between different applications. We can see the same issues occuring around APIs and data access in general. In recent months there have been a number of stories of developers scrabbling to adapt as APIs get changed, taken down, restricted or re-licensed leaving them high and dry.

To me the beauty of Linked Data, and RDF specifically, in this regard is that it is so much more portable than any other format. This means that we can easily replicate data to share the load of providing access. With Linked Data we have the option of federating or sharing data across the web. (One of the reasons we started the Talis Connected Commons scheme was to help create sustainability around Public Domain datasets.)

The portability of RDF also makes it easier for a range of organisations to offer scaleable value-added services over the same datasets. For the first time we can decouple the curation of data from the delivery of services over that data.

So those are my three challenges. I think these are largely point in time issues, but we’re going to have to work at them to move forward.

What about the opportunities?

Become a Hub

One of the interesting properties of the Linked Data cloud diagram is how it clearly illustrates the emergence of a number of hubs — like dbpedia — that form the focal points for links from a number of different datasets. If you look closely you can also see that there are emerging hubs within specific subject domains.

I wonder whether the hubs that we see today will continue to play such a key role as the web of data evolves? My feeling is that in a few years time the picture and connectivity is going to be quite different. Particularly if we continue to see engagement from government and other sectors.

There is clearly an opportunity here for organisations who are already key enablers within a particular sector to become a linking hub on the web of data.

If you poke around in any industry, its not hard to find organisations who act as the “switchboard” for that particular sector. Either because they manage some key identifiers for the sector as a whole, or because their identifiers and systems have become de facto standards for achieving interoperability. It would be a natural step for those organisations to carry that role forward to the web of data, retaining that key position.

Clearly not everyone can be a significant hub like Dbpedia. But every organisation can act as a hub for its community of customers, partners and users.

The reasons and benefits for doing this are well documented: opening up data can drive new business, innovation, and traffic. Success on the web involves giving your organisation the greatest possible surface area and points of attachment. Linked Data is an excellent way to achieve this as to emphasises the right forms of web integration.

Turn Identifiers into Channels

Linked Data requires you to assign URLs to identify things: people, places, events, whatever. Generally we tend to focus on how that is an important step to publishing data: concentrating on the mechanics of what makes a good, stable identifier and highlighting how this becomes a key way for other people to find your data.

What this misses is that those identifiers can also become channels, or hooks, for your organisation to find other people’s data. Once you have published Linked Data and it becomes linked to by other datasets all of that external data annotates and enriches your own, providing valuable and useful context. Linking data creates network effects, and everyone in the network benefits. That includes you.

The external data is easily accessible through link discovery so it becomes much easier to find, aggregate and analyse it for a variety of purposes. That might be to drive new product features, or to simply power business intelligence and analysis within the enterprise.

I tend to think of it as being able to fish the web of data for useful context. Your URIs are the hooks. Your data is the bait.

I stopped to draw a parallel here with some comments made by Dion Hinchliffe in his opening keynote. Hinchcliffe pointed to the rise of a number of startups and tools supporting analysis of data collected from the open web, perhaps mixed with data from internal enterprise systems. The end results of that analysis is new data and insights that will need to be integrated into an organisations core systems, especially if the intent is to drive more than just management reports.

My prediction was that over the next 12-24 months we’ll begin seeing this type of third-party organisations not just offering SaaS access to analysis systems, but direct insights that are already integrated into a customer’s data via the public identifiers its sharing as Linked Data. This has huge potential value and can completely change the costs and approach to data integration.

The time scales may be completely off. But there’s a real opportunity there in my opinion, particularly for organisations that do market and social media analysis.

Data as a Service

It’s been said before but its worth repeating: Linked Data isn’t necessarily Open Data. The technology is not at odds with exploring business models around data services or access.

The “Data as a Service” (DaaS) idea is gaining momentum in a number of different areas with an increasing number of commercial APIs coming online. We should also soon be seeing commercially available services directly powered by open data sources or through mining those sources.

There are a number of different business models that can be wrapped around data access, ranging from charging for the data itself, through cost recovery for service provision — something that may be relevant for long term usage of government sources — or just charging for delivering reliable, high performance services over open data. There are good reasons why developers may want to pay for reliable services.

Clearly open, sponsored access to data and services will remain an important part of the ecosystem. In fact some level of open data is required to drive the network effects we are seeing around Linked Data: the identifiers and some key metadata needs to be open and remain open; but additional “depth” could be available at a premium.

Summing up

I had no big conclusions to draw from my talk as my goal was to highlight the challenges and opportunities ahead. Clearly I could have chosen a different mix but drawing on my recent experiences engaging with a wide range of different organisations these are the issues and opportunities I’ve most commonly encountered and discussed.

Do you have a different perspective? Perhaps some ideas about how to face these challenges, or a different view of the immediate opportunities? If so, I’d love to hear from you.

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

Jim Hendler and Li Ding talk about work to convert Data.Gov resources to RDF

tw-dataIn my latest podcast I talk with Jim Hendler and Li Ding of the Tetherless World Constellation at Rensselaer Polytechnic Institute in Troy, New York.

We discuss work that they and colleagues have been undertaking to convert chunks of the US Federal Government data released via the data.gov portal to RDF.

During the conversation, we refer to the following resources;

This conversation was recorded on Friday 7 August, 2009.

For other Talis podcasts in this Nodalities series, see here

Garlik releases open source triple store, 4Store

4storeGarlik CEO Tom Ilube is increasingly coming to represent a voice of reason in the UK’s ongoing angst about Identity, with many a hysterically gibbering Home Office official put in their place by Tom’s more reasoned words in debates on the Today programme and across the UK’s mainstream media.

As the company’s press materials note,

“Garlik, the online identity expert, was founded by Mike Harris, founding CEO of Egg plc, former Egg CIO Tom Ilube and former British Computer Society president Professor Nigel Shadbolt. As the first company to develop a web-scale commercial application of semantic technology, Garlik enables consumers to protect themselves against identity theft and financial fraud.”

According to Wikipedia, ‘Egg… is now the world’s largest internet bank,’ so effective management of identity information is clearly nothing new to Ilube and his team.

Founded in 2005, Garlik has secured some £4.5million from 3i, Doughty Hanson and Noble Venture Finance to offer products such as their DataPatrol solution for tracking sensitive personal information online, and the less ‘serious’ measure of online status, QDOS.

Behind the scenes, data is aggregated from across the open Web and various proprietary databases, and stored in Garlik’s own RDF triple store.

Now the company is releasing their triple store — 4store — under a GNU GPL license and making it available for download. Capable of scaling to handle as many as 60billion triples (perhaps at least three times more than their closest competitors), 4store has the potential to address many concerns about the scalability of triple store technology.

I took the opportunity to talk with Garlik’s Tom Ilube and 4store’s designer, Steve Harris, before the launch and the result has just been released as a podcast.

This conversation was recorded on Tuesday 14 July, 2009.

Talis’ Tour

It’s been a busy couple of months for the Semantic Web research community. At the very end of May the European Semantic Web Conference
returned to Crete, where the series began in 2004. Now in its sixth year the conference reflected the vibrancy of the research community
in this area, the progress made to date, and the increased emphasis on deployment and uptake of Semantic Web technologies. The latter aspect
was noticeable in many parts of the conference, not least of which in the Semantic Web In Use track, a new addition to the ESWC series, co-chaired by Talis Researcher Tom Heath.

With adoption of Semantic Web technologies and Linked Data principles increasingly rapidly, many members of the research community met in
late June at Schloss Dagstuhl in Germany for a seminar titled “Semantic Web: Reflections and Future Directions”. Almost ten years since the first Dagstuhl seminar on the Semantic Web the goal of this event was to learn lessons from the past and map out the research agenda for the next ten years of the field. Again acknowledging the practical aspects of the field, there were lengthy and productive discussions on the topics of hosting and persistence of RDF vocabularies, and the urgent need to examine how Linked Data and the Semantic Web can enhance Human-Computer Interaction; both of which are topics close to our hearts at Talis.

The natural question that arises from exploring the next ten years of research in any field is “who’s going to do all the work?” Fortunately
in early July the Seventh Summer School on Ontological Engineering and the Semantic Web took place in Cercedilla, Spain, part-sponsored by
Talis. This annual event, directed by Enrico Motta (The Open University) and Asun Gomez Perez (Univ. Politécnica de Madrid), provides over 50 students from Europe and beyond with lectures, invited talks and group projects in cutting edge areas of the Semantic Web field, supported by a team of leading researchers. In addition to the knowledge gained from this intense week of study, students of the summer school get to network with their peers and build the very community that will drive forward the Semantic Web research agenda over the next ten years.

Linking Data and Semantics at O’Reilly

By Gavin Carothers and Charles Greer

|This article features in Nodalities Magazine, Issue 6

O’Reilly Media lives on the cutting edge. We coined terms such as Web 2.0, created the first commercial website in 1993, and exist to “spread the knowledge of innovators.” With our evangelists, conference presenters, authors, and bloggers all communicating and catalyzing new ideas, many believe that O’Reilly must be just as technologically innovative in our own operations. However, O’Reilly employs about 200 people but only half a dozen developers, so naturally ideas are thrown at our developers faster than it is possible to implement them. We’ve been known to refer to this tension between our public position on the cutting edge and internal expectation to live up to what we preach as “gaping wound tech.” Any time someone had a new idea or a new product to launch that didn’t quite fit into existing systems, we found some way to shoehorn it in, with a quick Perl script or some clever custom SQL. As we did this, more and more of our work became preventing our systems from collapsing under the weight of those one-off ETLs and scripts. The cost of simply keeping track of which scripts were using what bit of transformed data and where that data came from had became so high as to become unsustainable. We’d accrued so much design debt that only the most radical of approaches could save us from being crushed by the weight of our inherited code.

Of course, we didn’t really know that at the time. Today we have a Linked Data, Semantic, RESTful, URI-based, highly buzz-wordy solution mostly by accident and through ruthless pragmatism. Instead of embracing the ideas of the Semantic Web at the outset, we arrived at the Semantic Web because it was the only solution. We thought we were traveling down two completely unrelated roads. We started down the first while trying to replace a Java Bean Shell script that copied book content to a few different places. The other road began when we wanted to know what color to make the border of a PDF. The first would lead to an Atom Publishing Protocol server and clients, the second to our modeling all product metadata in RDF and opening that to the public.

As it turns out, the two roads weren’t so unrelated after all. RDF is designed to handle modeling information in a distributed manner and provides the underpinnings for the actual metadata we store, aggregate, and use. AtomPub’s RESTful interface is ideally designed for managing individual chunks of all this distributed data over time and provides programs and people a simple, standard interface for publishing, accessing, and updating it. As we progressed down each path, we were making (often unknowingly) major progress in generating linked data and semantics, the two pillars of the Semantic Web.

The RESTful Road

In 2005, soon after O’Reilly launched a custom book publishing platform, we discovered that we’d deferred a hard question. We didn’t know how to make sure that we could easily add new books as they came down the production pipeline. The canonical representation of nearly all O’Reilly titles is DocBook files. Historically, these DocBook files were scattered across many filesystems, transformed by people using one-off scripts, and arbitrarily transmitted using FTP to other filesystems. We simply didn’t have a way of addressing fundamental questions like “Where is the latest, cleanest copy of a book’s markup?” Tracking down the best representation of a book’s content was a laborious, error-prone task.
Around the same time we ran into this, we noticed Tim Bray’s superb presentations about the then-draft form Atom Publication Protocol. The architecture proposed by RESTful advocates like Bray and embodied by what would become RFC5023 gave us the ability to store an atomic chunk of data, assign it a URI and access and update it through a standard interface.

  • A book’s ”source code“, the DocBook markup
  • The print book, as an ISBN
  • The table of contents
  • A HTML, PDF or other representation generated from the source
  • Whatever Tim O’Reilly or the business folks asked for next

O’Reilly’s SafariU was a business venture that implemented these kinds of transformations of content, but didn’t expose anything but it’s own web browser interface.  When considering how to leverage SafariU’s technologies in the business as a whole, we arrived at this:

This atom:entry is the “latest, cleanest copy of a book’s markup” and its URI is the canonical location for this content. Additionally, the entry provides different views of the content using 17 distinct <link/> elements We had embraced the linked data idea Noun = URI.   Around the same time, we realized that while we needed a way to address various available formats of content, we also required a place to store and maintain our digital assets.   By implementing the Atom Publishing Protocol we established a generic way to maintain our assets, as Nouns, over time.  Now that systems could reliably find and update our content using URIs, it became painfully apparent that we still had a major uphill battle—how to do the same thing for product metadata?

A similar problem existed when dealing with metadata. Distinct applications were completely unintegrated and focused only on the browser and human users. They provided no visibility into their data for other systems.

rdf:isNeat

“Can our PDFs have the same branding and colors as the printed books?” —Marketing Person
“Sure! How hard can it be?” —Innocent Developer

At this point in our journey we have more than 900 titles in the AtomPub repository and addressable by URI. We’ve (unknowingly) hit a significant Linked Data milestone and everything is progressing well. Dynamically creating a PDF from these entries is as easy as running our DocBook-XSL customization for the correct series to produce XSL-FO and then rendering that XSL-FO into PDF. The only problem was discovering which series (In a Nutshell, Animal Guide, Missing Manual) the content fell under. At that point all progress stopped.
Our definitive source of book and product information is the Product Database (67,000+ lines of Perl, C++, SQL, and a dozen other languages). The database and web application has its own home-rolled “XML Format,” as I’m sure many other companies have had. Based directly on the column names from the SQL database, our Book XML was a quick and very dirty way of getting our centralized relational data out into the world as XML. A host of new client applications grew around this new access to product data, but we quickly saw the problems of reusing an adhoc, undefined, schema-generated format. The XML service was also incredibly slow.

<IPFamily>

<Book>
<product_id>5549</product_id>
<parent_product_id>6380</parent_product_id>
<imprint_id>1</imprint_id>
<product_status_id>5</product_status_id>
<product_type_id>10</product_type_id>
<isbn>0596515618</isbn>
<isbn13>9780596515614</isbn13>

<final_date>2003-07-02</final_date> <!-- Actually the day the last QC phase ended -->

...


As you can see from the snippet above, clients had to deal with knowing exactly what imprint 1 (O’Reilly Media, Inc.) and product type 10 (PDF) meant. Each client kept mappings of these magic values in order to make the data understandable. Those mappings broke, of course, whenever new product types and imprints were added. Even more dangerously, because the semantics of the XML were totally unspecified, element names were opaque and sometimes actively misleading. We might have redesigned the format to include more data and added more and more fields to it but this wasn’t an explicitly designed schema, just something generated from the SQL. On the road to exposing this data more cleanly we tried everything. Remodeling the SQL to be more relational didn’t offer much benefit and we still couldn’t tell what the column names meant. Sitting down and trying to write up a data dictionary was a great exercise, but it became out of date almost immediately. We experimented with JSON-based CouchDB prototypes, but those had the same issue as the SQL with missing meaning. Our Subversion repository is littered with Relax-NG, XML Schema, and Schematron documents to create new XML-based format. Somehow they never got finished as we discovered we either had to define everything or try to design for extensibility. We knew we didn’t have the time to create our own Book Metadata Standard. We wanted defined semantics.
There is at least one obvious XML vocabulary for a publisher looking to capture book metadata: ONIX. Unfortunately, the ONIX standard is archaic, with obscure element names like b004 (ISBN) and g343 (PrizeJury, obviously) (Footnote: Yes, these are the short versions and a longer set of names is also allowed. However, many of the most important vendors only support the short versions.) We did consider ONIX for a time, but then we noticed that every vendor we sent ONIX to treated the fields a bit differently. Even with pages and pages of specification there wasn’t any agreement on what elements were important or what they meant. Using ONIX as a format would not solve our semantic deficiency, we still wouldn’t know what the “columns” meant.
In the process of trying to create an XML format we asked a number of people in the company how to find the Publication Date for a book. The answer was surprisingly complex. The value was computed independently by each of the ETL hydras, with subtly different implementations that had evolved with particular client needs. O’Reilly isn’t a huge company with layer upon layer of bureaucracy; most questions can be quickly answered with a chat at a desk or an email to the other coast. Imagine our surprise, then, at the results of the Publication Date poll. Most people were confident that one of five dates was the right date, but disagreed on which of the five it was. Retail Availability Date, Actual In Stock Date, Estimated In Stock Date, etc each had its backers. What was really going on was that we discovered the subtle different needs that each business unit had.  The strategy we could most easily support?  Concensus on a public standard.  As we’ve learned so many times, we needed to go outside the company to find the correct solution. Public standards, specifications, and ontologies could save us from ourselves.
Enter: Dublin Core. We couldn’t define our own format or use the industry standard (ONIX), nor could we agree on what a publication date was. Our only choice was go borrow/steal some other group’s ideas. It turns out that our problems had already been solved by the library community. The Dublin Core Metadata Initiative created standards, guidelines, and examples for storing and sharing basic, essential metadata. We had a way out, here was a group of people who’d already done a great deal of thinking for us.
Of course, they hadn’t done all our thinking for us. Mapping all of our old data into well-designed and well-documented Dublin Core, MARC Relators, FOAF, or any other ontology was going to be hard. So we didn’t do it. Instead we mapped the whole of our old, horrible, ugly mess into an undefined ontology called the “Product Database Legacy Ontology.” We then moved some of the more obvious items like title and author into Dublin Core and waited. Only once we had a proven need for a new data point in real application would we go though the process of researching, defining, cleaning, and moving it into a modern, public ontology. For those following along closely: no, trim color isn’t yet in the public or internal metadata. As it turns out, no one really wanted it. At least, not yet.

All Together Now

Since Gavin’s first frenzied port of product metadata to an RDF model, we’ve been able to negotiate changing requirements, establish data validation and control rules, and bring on new applications with little time spent on data modeling. In other words, meeting our immediate need of a centralized, validatated data store of high agility and performance has paid off several times over in deploying new software systems for the rapidly changing company.
One example of the intertwining of Linked Data and Semantics is our Electronic Media distribution system, which lets customers download ebooks, pdfs, videos and the like. Book descriptions, titles, authors names, cover images even the help text provided on the Electronic Media page is simply linked data, built from RDF relationships. When we want to change the help text or a category label, we change it in one document, and everything else in the RDF graph referencing it changes with in moments as well. Just following links pays off.
Previously, the buttons that let a customer add a book to our shoping cart were generated by a system that used nightly ETLs nicknamed “the sync”. So new products would have to be prepped for release the night before. We gave special care to their timely appearance in the morning. Alas, they frequently did not appear as hoped, as the ETLs that made up “the sync” had to run in a very precise nightly schedule or we had to take manual corrective action. Now, a reasonably simple HTML template bound to the RDF for a book generates “Buy Buttons” in near realtime without an ETL in sight.
The greatest challenge of updating our legacy IT infrastructure hasn’t been replacing the ETLs or synchronization. It’s been achieving consensus on the meaning of data elements. In the past, data maintainers might adjust the title of a book to change how retailers present it. Then our website’s title would change (the next day), and we would have to bring resources to bear on reconciling the meaning of “title.” By using for our title element, we’ve established what to expect from those who change the value. It’s simpler to make sure people enter particular kinds of data, and then ask for help to extend or change requirements for downstream apps. The publicly available ontologies, we hope, will help everyone communicate more effectively about business needs and shared data points. So far the results are encouraging.

In the Public Eye

Having built several of our own applications using our new RDF metadata and our initial linked data APIs, we thought it might be a good idea to let someone else have a crack at it too and see what they made of it. It took us two weeks to develop the O’Reilly Product Metadata Interface, a simple layer on top of the Deli. A caching proxy preserves the reliability needed by our own applications, while a predicate filter prevents private information from leaking to the public. A bit more about how you can access it can be found at http://labs.oreilly.com/opmi.html or you can just dive right in by giving it an ISBN, IE: http://opmi.oreilly.com/product/9780596529260.
Sharing our work with the public forced us to be much more deliberate and rigorous about our data, but also exposed some simple blunders. On the day we launched the service we waited for the praise to come in and finally saw a tweet! Someone is using… Oh wait:

OPMI’s book identifiers aren’t resolvable. Sigh.” —Jeni Tennison

“Of course they’re resolvable,” we thought. “You just have to parse the URN and understand how to pass the URN to… oh, yeah good point.” In the process of implementation, we’d forgotten Tim Berners-Lee’s second rule of Linked Data:

2. Use HTTP URIs so that people can look up those names.

At the start of the process we’d talked about about using some sort of identifier for our products. But that conversation had taken place before we really had all the RDF and Linked Data applications working, so at the time there wasn’t any point nor could anyone see the need for a resolvable identifier. Within a few hours of making the data public, the need became blindingly apparent. Part of embracing “anyone can say anything about anything” is that anyone needs to be able to find the anything they want to talk about. And when you’ve got a statement to make, it’s remarkably handy to be able to quickly find out what else has been said. “I loved urn:x-domain:oreilly.com:product:9780596529260.BOOK” is a bit hard to figure out. “I hated http://purl.oreilly.com/product/9780596529260.BOOK” is a lot better.

Growing the Web of Data with Data Incubator

At Talis we’re huge fans of Linked Data, especially when it’s freely available for reuse too. However, we also realise that not everyone has been smitten by the Linked Data bug yet so we’re always thinking about new ways to help others use, publish and discover the benefits of connecting their data together.

Recently we were wondering how we could help organise the skill and expertise of people who love Linked Data to show data publishers how their data could be even more useful and effective. As the Linking Open Data project has shown, actions speak louder than words so we wanted to do something with practical and visible results.

One problem we face is that until it is available in open and reusable formats it’s not possible to show data owners the power locked up in their own data. Conversely it is hard for the data owner to justify investment in opening up their data without concrete demonstrations of that power. A classic deadlock situation! The goal of our new project is to break this deadlock. We plan to do this by organising people around popular datasets to create mappings to RDF, write conversion code and openly publish the resulting data. The result will be a huge reduction in the investment needed by the data owner: they can simply adapt the work and emit the Linked Data themselves.

We call our new project the Data Incubator and if you love Linked Data then we encourage you to join in and help grow the web of data. Although this project is entirely independent of Talis, we are supporting it through the Talis Connected Commons scheme, providing free hosting and services for public domain data.

Already we have started projects to convert the Open Library dataset including much-loved books such as The Hobbit and to convert journal metadata provided by CrossRef, Highwire and the National Library of Medicine. Many more projects are being incubated and we are discussing how we create a repeatable process for contacting and encouraging data owners to take part.

Join the Data Incubator mailing list and get involved.

Streams, Pools and Reservoirs

by Leigh Dodds
| this article features in Nodalities Magazine, issue 6

As we start to move past the current boot-strapping phase of the semantic web in which we are constructing the web of linked data, its useful to begin discussing what other feature and infrastructure we need in order to support sustainable usage of this huge and growing data set: what services can be offered over linked data? Do we need to consider how to provide quality of service, stability and longevity to the data, or does the sheer scale of the web make these moot points?

In order to answer this question it’s useful to compare the ongoing development of the linked data web with that of the web itself.

A Brief History Lesson

There have been several phases of activity in the development of the web. While in truth, these phases were of different duration, overlapped with one another, and have happened at different rates within different communities, essentially we have gone with the following basic steps.

Firstly we concentrated on just getting stuff on line. The early web was a new medium for document and data exchange and so was at its core a simple publishing device used as a collaborative space between small communities. But as the amount of content and the size and breadth of those communities grew, the emphasis shifted towards linking: tying content together to create, – initially hand-crafted – indexes of the web and knit the available content into a greater whole.

The second, manual linking phase was quickly supplemented by a third phase of automated linking between content: search engines. A search engine is simply a way to quickly create a link-base based on some search criteria. The crawling and indexing of the document web by web crawlers allows users to quickly construct links to content of potential interest.

If we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

The third phase of the web’s development has been triggered by the commoditisation of search and the need for search engines to differentiate themselves and offer additional value-added services. Search engine features are now tailored towards particular uses or types of content (Google Image Search; Google Scholar); offer value-added features that capitalise on the ability for search engines to analyse the structure and traffic flows across the web (PageRank and similar indexing improvements; Google Trends); expanding the audience for content (Google Translate); and enabling community-driven customisation of the search experience (Google Custom Search; Yahoo Search Monkey, etc).

No doubt there will be subsequent phases of development, and the perspective of history will let us tease out common strands of development some of which will already be happening. But if we look at the recent, rapid development of the linked data cloud, we can already see that the same pattern is being repeated.

History Recapitulated

There has been RDF data available on the web for many years, used by a limited community of researchers. This slow accumulation of content – echoing the first phase of content publishing on the document web – has been replaced by a rapid increase in data publishing encouraged through the Linking Open Data (LOD) project. By providing clear pragmatic guidance and instructions on how to publish data for the semantic web, that project has enabled us to accelerate our transition through that first content publishing phase. But it has also, crucially, encouraged the linking together of data sets (Phase 2).

This linking has to a great extent been manual. Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

The rapid growth of the Linked Data Cloud means that this “manual” phase will soon be over: there will be sufficient momentum behind the semantic web that increasing amounts of data will become available and no single community will be able (or need) to shepherd its development. The focus will shift towards the subject specific communities who will instead co-ordinate at a more local level. Semantic web search engines will also become a reality.

Semantic Web search engines need to be distinguished from semantically enabled search engines. The latter use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans. A Semantic Web search engine should offer infrastructure for machines. This Third Phase is also beginning to take place. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct link bases, based on some simple expressions of what data is of relevance, in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents.

Co-Evolution of the Web Infrastructure

Given the strong concordance between the phases of development of the document and linked data web, it is reasonable to make some predictions on how semantic web search engines, and additional supporting infrastructure, is likely to evolve by comparing them with the development of human search engines. For each of the specialisations and value-added features listed earlier its possible to see an equivalent for the machine-readable web:

Document Web Semantic Web Infrastructure Description
Google Image Search Type Searching Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate Vocabulary Normalisation Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search Community Constructed Data Sets and Indexes Ability to create and manipulate custom subsets of the linked data cloud
Google Trends Linked Data Analysis & Publishing Trends Identifying new data sources; new vocabularies; clusters of data; data analysis

These last two are particularly interesting as they suggest the need to be able to easily aggregate, combine and analyse aspects of the linked data cloud. This infrastructure will need to be able to support the community in working with data in a variety of ways, allowing data to flow and be collected where it is needed. Introducing a metaphor for this process might help highlight some of the processes and its consequences.

Flowing Data

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir.

Data is like water and flows of data are like streams. These streams of data can arise from any number of different sources: from a person entering data into a system; from a click stream generated as a side-effect of web browsing; application events; or generated from real-world sensor measurements. There are already many ways that we can tap into these data streams, using web-based query APIs, messaging systems like XMPP, or syndication protocols like Atom and RSS.

While these streams of data are already supporting a huge range of different applications and use cases, they are inherently limited: a stream has no memory. If historical context is required, e.g. to support more complex querying and reporting, then each consuming application must collect and store the data. We can think of these collections of data as pools; each stream of data on the web may feed any number of different application-specific pools.

A pool of data provides extra flexibility, but comes at the cost of requiring each consuming application to maintain its own infrastructure to hold copies of that data. Even if each source of data provides direct access to its own pool, e.g. by exposing a web-based query interface onto its database, or by exposing linked data, there are still unnecessary overheads. Each data provider must provide their own scalable infrastructure and support a rich set of data access options.

If we start building large pools of data, within a community supported infrastructure, then we have a reservoir. A reservoir is a pool of data that is maintained by and services a specific community. Reservoirs allow issues such as quality of service (reliable supply of water) and infrastructure costs (building of pipelines) to be solved at a community level.

Its possible to argue that the web already consists of streams, pools, and reservoirs, but there is a distinct difference between a web based on semantic web technology and a Web constructed of a mixture of XML documents or similar formats: like water, at the molecular level, all RDF is the same; its all triples. Unlike alternatives, RDF data is more easily pooled and collected and so is much more amenable to explorations of shared infrastructure. Like a relational database, an RDF triple-store can contain an huge variety of different kinds of data. But unlike a relational database, an RDF triple-store, has the potential for the aggregate to be much more than the some of its parts. The seeds of convergence are built in, through reliance ah the most fundamental level on a global naming system (URIs) and standardised ways to state equivalence and relationships between resources.

In the real world, reservoirs do more than supply a community with water. The aggregate has its own uses: water skiing or hydro-electric power generation for example. And the same will be true of semantic web data reservoirs: large collections of data can be analysed and re-purposed in ways that are not possible – or at least not achievable without a great deal of repeated, redundant integration effort – using other techniques. The reservoir itself can be the source of new facts and new streams of data derived from analysis of its contents.

Flowing Data through the Talis Platform

The goal of the Talis Platform is to support the growth of the Linked Data ecosystem by providing the infrastructure to support the creation of pools of data. For additional background, see my article “Enabling the Linked Data Ecosystem” from Nodalities issue 5.

At present the Platform provides a range of services that allow data to be easily streamed into and out of Platform stores, allowing data to be easily pooled in order to benefit from greater context. Data can be pushed directly into the Platform and we are exploring methods of supporting other forms of data ingestion to make it easier and more natural to begin to accumulate data sets within the Platform.

The core search service, which produces its results in RSS, allows the creation of simple data streams, while the SPARQL interface supports more complex data extraction methods. The Augmentation service provides an interesting twist on these conventional approaches, providing a means for any RSS 1.0 feed to be automatically enriched with extra metadata by feeding it through a Platform data store. This means of interaction is like fishing for data: it is possible to serendipitously find and extract data, capturing it as extra context to items in an RSS feed, without having to deal with writing SPARQL queries or constructing a keyword search. There are many more methods and modes of data extraction that will be added to the Platform to add to these existing services; this is just the beginning.

But the Talis Platform is intended to provide much more than just the ability to work with pools of data. The bigger vision is to support the creation of true data reservoirs, and enable many different ways of manipulating and analysing their contents in order to discover new facts and bring new context to that data. Creation of these larger pools of content will need to be made sustainable for the communities that are creating them, and deriving value from them. Sustainability covers a wide range of issues that go beyond just commercial issues: quality and range of services are additional factors, as are forms of governance, trust and quality that relate to the data sets themselves. The Platform is intended to address all of these issues.

To take a small example, the experimental “store groups” feature that was released at the end of last year, provides a simple method for combining datasets, without requiring that data to be completely loaded or copied into a single database. The store groups feature will ultimately support a range of services over the constituent data sets, allowing each pool of data to remain intact whilst still contributing to the whole; this will be important to support the new forms of governance that are beginning to emerge around datasets on the Linked Data web.