Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

data.gov.uk and the Talis Platform

Earlier this year Gordon Brown appointed Tim Berners-Lee as an advisor to the Cabinet Office to help the government begin the process of opening up its data. This was one part of the initiation of a project to begin opening up UK government data in a similar style to the US. A key part of Berners-Lee’s vision for putting government data online has been Linked Data which promises to provide a much richer way for citizens to begin accessing, browsing, and using government data.

Several other governments have begun opening up data assets including Australia and New Zealand. These approaches mirror that of the US data.gov site, providing a browsable directory of datasets and links to raw data downloads in a range of different formats. The preview launch of data.gov.uk which was announced at the end of September also includes a directory of datasets which is powered by the software underlying the Comprehensive Knowledge Archive Network. But the site also aims to fulfill Berners-Lee’s vision and in addition provide access to some datasets as Linked Data through SPARQL endpoints.

We’re very pleased to report that the Talis Platform is currently underpinning the delivery of all of the Linked Data and SPARQL endpoints for the data.gov.uk site.

We’ve been quietly supporting the effort for several months now helping out with data management, modelling discussions, and with training on the core technology. There seems to be a very definite appetite in government to not only open the raw data but to also explore the potential for Linked Data. Its clear from today’s announcement about opening up additional aspects of the Ordnance Survey data that there’s a real focus on delivering on the open data promise. While there are certainly some high-profile datasets like the Ordnance Survey or postcode data that may require legislative changes to become open, one of the biggest implementation challenges facing government is pulling together an overall directory of datasets and spreadsheets that are already scattered across multiple departmental websites.

Creating a dataset directory provides the required basic level of infrastructure to allow reuse, by enabling developers to find what they need; publishing Linked Data, SPARQL endpoints, and potentially extra APIs provides an additional set of options for ways to access the data. By letting datasets be browsable by anyone, not just developers, Linked Data offers the potential for anyone to find, discover and reuse interesting datasets. As I illustrated in a recent talk, these approaches are not mutually exclusive and the goal should be maximum utility.

Over on the Talis Platform developer blog we’ve begun showing some ways that the initial datasets, covering UK schools and traffic measurements can be queried in interesting ways. Its been exciting to see people begin to pick up the technology and creating reporting tools to explore the data, but also fantastic to be able to easily view data using only a browser.

There’s clearly still a great deal of work ahead, but the ground work has now been completed: there’s infrastructure in place to support data publishing; official guidelines on creating public sector URIs; and some agreement on best practices for modelling statistical data. The next challenge is to start ramping up the conversion of currently open data into RDF, in order to begin expanding the coverage of the Linked Data.

This is a very exciting project and here at Talis it’s something in which we’re very proud to be playing a role.

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

Interesting semantic web stuff

By Tom Scott
| This guest post originally appeared on Tom Scott’s blog; republished under CreativeCommons License, and with kind permission of the author.

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.

TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:

  • overseeing the creation of a single online point of access and work with departments to make this part of their routine operations.
  • helping to select and implement common standards for the release of public data
  • developing Crown Copyright and ‘Crown Commons’ licenses and extending these to the wider public sector
  • driving the use of the internet to improve consultation processes.
  • working with the Government to engage with the leading experts internationally working on public data and standards

The Guardian has an article on the appointment.

Closer to home there have been a few interesting developments

Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.

The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.

A social semantic BBC? [slideshare]
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.

PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
Including an interview with me!

You should also check out…

sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.

Enhanced by Zemanta

Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr

The BBC, the Graph, and Linked Data Stores

Over the past few weeks, Talis has been working with the BBC to crawl their programmes and music sites and pull in a bunch of usable data into a Platform store. This store now contains information on over 360,000 programmes and more than 34,000 musicians. There is data about albums and reviews, and about programme series and even versions of episodes. This is an interesting dataset.

What’s more, the BBC have made this data available to you to mashup and make use of. They’ve discussed their SPARQL endpoint on their Backstage developers’ blog. We’ve got more details about the store, including information on how you can get a hold of the data over on our n2 developers’ blog.

Leigh, in the n2 post, listed several applications he could see for the data:

Programme Reviews. It’d be easy to build a mashup of the BBC programmes data and something like Revyu (which also has a SPARQL endpoint) to allow someone to review a programme that they watched last night. Note, that as our crawling will be lagging behind the live site until we’ve implemented real-time updates, there will be a lead time between something being aired and in the Platform for reviewing.

PVR Integration. There are a number of open source PVR solutions out there, could some of these be updated to automatically pull in additional data from the endpoint to improve electronic programme guides?

Geographic Overlays. The interconnections between radio programmes, artists and their locations, offers an opportunity to build some mapping mashups, using either Google Maps or Earth. For example it ought to be possible to lay out the geographic spread of artists played by different BBC radio programmes and stations. Interested in music from a particular country or region? (Maybe you’re planning a trip there and what to pick up on the local vibe) Then use a map to home in on radio programmes that are most likely to play those artists.

Fan Widgets. The ability to extract data from the endpoint using SPARQL and JSON means that its really easy to create little widgets to include programme data on external web pages. What could something like the Doctor Who Tardis Index File be enriched by widgets that came straight from the BBC database? Throw in additional annotations from the community and you could make some really interesting embeddable gadgets. Of course there’s also the other direction: if fan communities start using BBC identifiers then the BBC may be able to feed this crowd-sourced data back into their site, just as they’re doing with Wikipedia (via dbpedia)

Under the Talis Connected Commons scheme anyone can have free hosting on the Platform for public domain data, so if a fan community wanted to organize itself around creating additional annotations for BBC programmes (how about character lists? mood assessment? scene breakdowns?) then these can be stored in the Platform for free, and then mashed up with the BBC data on the server-side using features like the Augmentation service, or on the client-side using SPARQL and JSON. Lots of potential there.

Ivan Herman talks about the Semantic Web and W3C

Ivan Herman pictureIn my latest podcast I talk with Ivan Herman, Semantic Web Activity Lead at the World Wide Web Consortium (W3C).

We discuss W3C’s continued engagement with Semantic Web activity around the world, touch upon current activity to enhance existing specifications such as SPARQL, and consider the success of the Linked Data meme.

During the conversation, we refer to the following resources;

This conversation was recorded on Wednesday 8 April, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.

This Week’s Semantic Web

Selected links related to Semantic Web technologies for the week ending 2007-04-21, all weeks. Also available in RDF as linked data or via GRDDL.

How the Web Works

- source: danbri on Flickr, CC
license

A fairly random mix this week. Above you can see Dan Brickley‘s revision of the imagery that first saw light in Tim Berners-Lee’s slides in 1994. The original slides showed the connection between Web documents and things in the real world. Danbri’s added a nice twist in the thoughts of the people in the diagram – they don’t necessarily see the world in the same way. Fortunately Semantic Web technologies offer ways of saying things which allows for differing perspectives, and means to make use of coincidences between things different people have said. In recent months DBpedia has played a key role in the Linking Open Data cloud by providing common reference points derived from Wikipedia (check the podcast with Richard Cyganiak). This week sees the announcement of a set of new services around UMBEL, an upper-ish ontology which aims to provide a similar role.

There are a few more offerings for further up the Semantic Web stack from the W3C, and to demonstrate that such things aren’t necessarily incompatible with regular RDF development, Jim Hendler points to an OWL 2 profile that should be of interest to even the most web-fetishist audience: Towards RDFS 3.0 (or OWL 2 R Full).

A little trivia: subscribers to Planet RDF (one of the main sources for material here) may have noticed my personal blog hasn’t appeared recently, I’ve still got a few things to fix up after a bad server crash. This blog may not appear over there either for a little while as the Talis blogs are migrated over to being
WordPress-based. So the only conduit from Talis to Planet RDF this week is Ian’s blog, latest offering: The Terrible And Tragic Tale Of Brian The Snail. I suppose there’s always the magazine

In the Media

Docs

Software News

Events etc.

Miscellany

~

Sources include Planet RDF, various other blogs, Semantic Web Interest Group IRC Chatlogs & Scratchpad, ESW Wiki, SemWebCentral, Sweet Tools, W3C Semantic Web Activity, mailing lists, personal emails etc etc. If you see anything suitable this coming week, please mail meor use the del.icio.us tag “twsw” – thanks!