Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

LOD Around-the-Clock (LATC)

Guest post by Lin Clark and Michael Hausenblas, DERI

LATC Project logo In this, the Petabyte Age, technologists have a growing obsession with data—Big data. But data isn’t just the province of trained specialists anymore. Data is changing the way scientists research and the way that journalists investigate; the way government officials report their progress and the way citizens participate in their own governance.

The challenge that all of these accidental technologists face is how to surface data and bring data together in meaningful ways. As Google’s chief economist Hal Varian has said, the scarce factor is no longer the data, which is essentially free and ubiquitous, but now the “scarce factor is the ability to understand that data and extract value from it.

The emerging Web of Linked Data is the largest source of this data—multi-domain, real-world and real-time data—that currently exists. As data integration and information quality assessment increasingly depends on the availability of large amounts of real-world data, these new technologists are going to need to find ways to connect to the Linked Open Data (LOD) cloud.

With the explosive growth of the LOD cloud, which has doubled in size every 10 months since 2007, utilising this global data space in a real-world setup has proved challenging; the amount and quality of the links between LOD sources remains sparse and there is not a well-documented and cohesive set of tools that enables individuals and organisations to easily produce and consume Linked Data.

A new project aims to change this, making it easier to connect to the LOD cloud by offering support to data owners, Web developers who build applications with Linked Data, and small and medium enterprises that want to benefit from the lightweight data integration possibilities of Linked Data.

LATC to the Rescue

The new LOD Around-the-Clock (LATC) project kicked off on September 13-14, 2010 at the Digital Enterprise Research Institute in Galway, Ireland. LATC brings together a team of Linked Data researchers and practitioners from DERI (National University of Ireland Galway), Vrije Universiteit Amsterdam, Freie Universität Berlin, Institut für Angewandte Informatik, and Talis.

This team will support the production and consumption of Linked Data by providing:

  1. A recommended tools library for publishing and consuming Linked Data, supplementing documentation for the tools, and free implementation support for large-scale data publishers and consumers. Tools include the D2R Server for publishing relational databases on the Semantic Web, the Drupal CMS and related publishing and consupmtion tools, and others.
  2. A 24/7 interlinking platform (see Fig. 1) that acquires new data and creates links between existing datasets in the LOD cloud.
  3. Publication of new large-scale LOD datasets with data from governmental departments and other organizations. The focus will be on EU level datasets such as CORDIS, the European Patent Office, and Eurostat.

LATC Structure Diagram

Homepage:
http://latc-project.eu/
Twitter: @latcproject
Duration:
09/2010- 08/2012
Total cost: 1.19 M€
EU contribution: 1.06 M€
Further information:
Dr. Michael Hausenblas
IDA Business Park, Galway, Ireland
Tel. +353 91 495730
michael.hausenblas@deri.org

In addition to the core team, a large Advisory Committee with more than 30 members will participate in the LATC activities and connect the Linked Data community to LATC’s recommended tools library and support services. Organizations on the Advisory Committee are entitled to support from the project and thus will be in a position to give feedback to improve the support services. The Advisory Committee includes governmental organisations such as the UK Office of Public Sector Information and the European Environment Agency; researchers and practitioners such as the University of Manchester, University of Economics Prague, Vulcan Inc., CTIC Technological Center, the Open Knowledge Foundation; and standardisation bodies, including W3C (Tim Berners-Lee). The LATC partners will also liaison with other EC projects and related activities, including LOD2, PlanetData, SEALS, datalift.org, Semic.EU, OKFN, and the Pedantic Web group.

LATC organises and supports a number of community events, including tutorials at the International Semantic Web Conference 2010 in Shanghai, China, as well as the Open Government Data Camp, London.

LATC is a Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).

Linked Open Data and Pavlova

rjw_caricature_mini If Sir Tim Berners-Lee can equate Linked Data with a packet of  crisps/potato chips, I thought I would take a stab at another food metaphor for this post. 

Linked Open Data (LOD) is a concept that many believe they understand.  Take yourself to most any conference that has a connection with data, or the web, or the Internet at the moment, and it will not belong before you see a slide of the Linked Open Data cloud diagram, or of Sir Tim imploring us to give him our raw data now, or if you are very lucky a shot of him doing his imploring whilst stood in front of a shot of the LOD cloud.  -  Simple really, just publish your data as Linked Open Data and all will be wonderful as we move towards the sunlit Semantic Web uplands.  Unfortunately life is never that simple – LOD is not a single identifiable thing.  As Paul Walk eloquently puts it:

  1. data can be open, while not being linked
  2. data can be linked, while not being open
  3. data which is both open and linked is increasingly viable
  4. the Semantic Web can only function with data which is both open and linked

As with any recipe for success, the majority concentrate on the final result.  Praising or criticising it as a whole, without identifying the benefits or otherwise, of the individual ingredients.  Take a strawberry pavlova for instance.  If you you are in to that kind of thing, a delightful culmination of the culinary arts designed to send your taste buds in to raptures.  Unless that is, you don’t like cream, or you don’t like strawberries, or can’t abide meringue, in which case the whole thing seems a little pointless.

What has this got to do with Linked Open Data (LOD), I hear you ask.  Well, I am increasingly seeing LOD being presented as the goal for those wishing to publish their data on line.  My position is that the eventual goal, from which will spring a Semantic Web, is a global web of linked and open data. However, there are many steps from where we are now to achieving that goal.  Within audiences that I present to, and/or sit amongst, I see people who for whatever reasons do not ‘get’ one or more of the components of LOD – they cannot envisage opening up any of their data, or think that using a web address for an identifier is over complex, or have a religious aversion to RDF.  As a result they dismiss the whole recipe as not for them, or worse still, as something impractical that will become nothing more than the plaything of a few passionate enthusiasts.

When someone who is still struggling with the concept of opening up their organisation’s data; or why RDF might be a more useful format than csv, is shown the ubiquitous Linked Open Data cloud diagram with encouragement to join in – it is hardly surprising they remain a little unconvinced.  This isn’t a criticism of presenters either.  In only 20 minutes on a stage, it is difficult to go into underlying detail.

Let my try in a few paragraphs to break the LOD pavlova in to it’s ingredients

  •  Data – In the context of  this post, by data I mean machine readable information, produced in a format that can be consumed and processed by other machines.  Inevitably, this means file formats such as csv, XML, RDF, etc. , but not something like pdf, html, or word, which although they are in a transferrable format it is designed for human consumption not machine analysis.

    For some, just this step from their current human targeted format, to a machine readable one, is a significant one.

  • Open Data  – Data (see above) which is accessible for all to download, view, and consume in a way that is not encumbered by licensing that restricts its use.  For example, the licensing used by data.gov.uk data.  By definition data which is restricted for certain uses is not fully open.  

    In our internet based world, openness can also be defined in terms of technical accessibility.  If it is only available after a login process, or it is only available to users behind a firewall, it couldn’t be considered as open. 

  • Linked Data – Data (see above) which contains URIs as identifiers for concepts described in the data and URIs to identify the relationships between those concepts.  The four Linked Data Principles, as published as a design note by Tim Berners-Lee, provide a bit more detail on this.

    I am in danger of stirring the embers of a religious fire fight here, between those that believe that Linked Data must be described in RDF and contain URIs as identifiers, and those that maintain that you can have data linked across the web without those constraints.  All I am going to say on that at this time, is that the Linked Open Data cloud of data sets has been successful, based on the first of those two views. (if you want to follow that particular debate in more detail, Paul Miller’s post and associated comments would be a good starting point)

So, how can data be open, but not linked? – by publishing in in a non-Linked Data form such as a text file or a html page or a pdf.  Where would you find this? – all over the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to consume.

How can data be Linked but not open? – by publishing it in accordance with the principles, in RDF, with URIs, but restricting access either by imposing restrictive licensing conditions or restricting access to the data.  Where would you find this? – again all over the web, but often hiding behind restrictive licensing terms such as “non-commercial use only”.  Also to be found inside organisational firewalls.  For example, commercial organisations can realise the benefits of  using Linked Data techniques with their internal private data.  Potentially linking it to publicly visible concepts across the web to add even more value for their employees.

Data that is Linked and Open, like that strawberry pavlova, has the power to deliver value beyond the sum of its individual ingredients.  By providing data in a form that is linked to other data, and easy for others to link to, without restrictions on who or how that linking takes place, provides the foundation for a web of linked data built on the same principles that fostered the growth of the web of documents that has so changed our world over the last decade and a half.

The ingredients that formed that World Wide Web of documents – html, http, open publishing of web sites without restrictions on other’s abilities to consume and/or link to them – individually  were important developments.  However, when those elements were blended together their effects were multiplied many fold and resulted in the web we experience today. 

So [as I stretch my culinary metaphor to it’s limits] if you are hoping to take people with you in building a Linked Open Data future, you not only have to show them a picture of the final dish, you need to describe the individual ingredients and their relevance to the eventual result.

Pictures from Flickr by PhOtOnQuAnTiQuE and avixyz

Best Buy: Semantic Web and Retail

In this Nodalities Podcast, I speak with Jay Myers from Best Buy about how he and his team are working within the retail giant to better harness their data. Jay tells us about his use of blogs and RDFa to better manage “open-box” products returned to Best Buy’s many stores in an effort to surface deals to the public and make savings on otherwise costly problems.

Jay also explains how Best Buy are publishing the machine-readable data out on the public web and touches on the next steps Best Buy will be taking. He also calls on the Semantic Web community to take an active role in promoting work like this by voting for his panel at South by Southwest, which you can see here.

Jay Myers is a Lead Web Development Engineer for Best Buy, and is an active supporter of the GoodRelations vocabulary for ecommerce, utilizing it for modeling consumer products, stores, and services in both RDF/XML and RDFa. For more information, you can read his blog or catch him on Twitter.

A conversation about The Interactive Knowledge Stack

wernher_behrendt John_periera1 My guests on this Talking with Talis podcast are Wernher Behrendt  and John Pereira of Salzburg Research.  They are part of the team behind IKS – The Interactive Knowledge Stack an Integrating Project part-funded by the European Commission.

The four year project started in January 2009 to provide an open source technology platform for semantically enhanced content management systems.  The concept behind it being, that once developed, the stack can be bolted-on to many different CMS products to add semantic, and semantic web, capabilities.  Even though the project is open source, and the obvious use of it is with open source CMS tools, it’s use could be of equal value to commercial products.

 

Their target is engage with 40 small to medium organisations for whom developing such capability would not be possible with their limited resources.  They are already well on the way, with many joining in via the project Web site and participating at the first early adopters workshop in Salzburg in June.

Technorati Tags: ,,

Extending the Semantic Web (from Crete, with love)

This is my first year attending the ESWC (formerly “European Semantic Web Conference” now the “Extended Semantic Web Conference,” cleverly, the acronym still works) near Heraklion on Crete. It’s only a couple days in, but I thought it’d be a good time to report back to the Nodalities readers. ESWC is a gathering of some of the world’s most influential Semantic Web thinkers, and for me It’s been a few days of meeting people in the flesh with whom I’ve been in touch online for years. As one bloke put it: “What’s kept you away?”

Well, I’m extremely glad I’ve not been kept away this year, and have been excited to see what’s been built recently. ESWC is a very academic conference; indeed I’m quietly auditing the PhD Symposium as I type this. There are papers, PhD symposia, demos and expositions on topics covering anything from ontology development to MapReduce processing of RDF triples. It seems a very fertile seedbed, with many of these ideas having the potential of growing into projects, startups, papers and possibly industries.

I’ve made a subtle and largely subconscious transition by blogging mostly about projects that are up and running. This has been important because the Semantic Web world is no longer one of “someday,” but a world of current and continuous activity. So, I’ve talked about visualisations of data, products running on Linked Data, data.gov.uk et.al.; and I’ve held back on discussing purely possible. It’s been exciting and uplifting to see the conceptual evolve to the proven and working. But this is a reflection of progress—of moving from hypothesis to implementation. It doesn’t mean the concepts have stopped flowing. It’d be a very short story in the history of human communication if the Semantic Web has used up all of its possibilities in ten years!

ESWC is a little microcosm of the wider research going on in Linked Data and related fields. It seems to me that Big Ideas need the traditional frameworks of academic investigation. Questions need to be asked and answered and debated and tried and broken and rebuilt. Much of this science will not become technology, and this is wholly acceptable because it gives the Big Ideas a lot of scope to be refined.

ESWC is just such a place. PhD students and researchers fill the schedule with proposals and reports, and many possibilities are being constantly debated around coffee, beer, and the beach. It’s been a thoroughly fascinating few days, and I’m very much looking forward to more over the next few.

As a quick note, Talis sponsored the Scripting for the Semantic Web challenge for this, its final year. Alexandre Passant and Pablo Mendes won the prize with SPARQLpush.

The Greatest Challenge Facing IT

by Lee Feigenbaum and Mike Cataldo

|This article features in Nodalities magazine, Issue 7

As the old adage goes: Time is money.

Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.

anzo-on-the-web-1A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.

This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.

Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.

Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.

Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.

Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.

In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.

The solution is to find data by its meaning rather than its location

The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.

anzo-on-the-web-2With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.

Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.

Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.

Data Collaboration and the Semantic Web

It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:

  • Flexible, define-as-it-arrives, data structures
  • Explicit relationships that travel with the data
  • Data that is accessed by its definition rather than its address
  • Distributed query

As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.

Practical Data Collaboration to Solve IT’s Challenge

Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:

  1. Virtualizes the data so that it is accessible by its description regardless of location.
  2. Lets users create their own views of data.
  3. Fills in the views by traversing the fabric and picking out the relevant information.
  4. Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.

The Right Data…

anzo-for-excel-1At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.

But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.

…For The Right People

Getting the data in front of the right people relies on three things: context, security, and “reach”.

Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.

Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.

Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.

…At The Right Time

Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.

First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.

Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.

Data Collaboration in the Days to Come

Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.

What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.

Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.

Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.

Enhanced by Zemanta

Building A Civic Semantic Web

By Joshua Tauberer
| This article features in Nodalities Magazine, Issue 7

Technology is a new key player in government accountability and transparency. It’s our own defense against the threat of government information overload. Take the U.S. Congress: More than 10,000 bills are on the table for discussion at any given time, and Members of Congress are taking campaign contributions from thousands of sources. How can a representative be accountable if his legislative actions are too numerous to track? How can financial disclosure root out conflicts of interest if the interesting ones are buried deep within piles and piles of records? The thread to transparency isn’t shear volume, however. It’s the complex network of relationships that makes up the U.S. Congress, and that makes it an interesting case for applying Semantic Web technology.

What the Semantic Web addresses is data isolation, and this is a problem for understanding Congress. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily meshable that MAPLight is possible. The Semantic Web makes this process cheaper by addressing meshability at the core. The more government data that is meshable, the easier it is to investigate connections across independent data sets, research the dynamics of the system, or teach others how Congress works.

Innovating the public’s engagement with Congress by applying technology has been the motivation behind my site www.GovTrack.us, a free congress-tracking tool that I built and have been running since 2004. GovTrack amasses a large XML database of congressional information, including the status of legislation, voting records, and other bits, by screen scraping official government websites that have the data online already but in a less useful form.

If “metadata” is tabular, isolated, and about web resources, the Semantic Web goes far beyond that. It helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in Congress with Members of Congress, what districts they represent, their population demographics, etc. We establish relations like sponsorship, represents, voted, and population across entities of many types. A web lets us ask new questions, and from there transforming their answers into visualizations. And because the Semantic Web is a generic platform for all data, I actually think it has the potential to radically and fundamentally transform the way we learn, share information, and live—but that’s still a bit far off.

So for the purposes of my tinkering with the Semantic Web, GovTrack creates an RDF dump of its database (13 million triples) covering bills, politicians, votes and more using a mix of existing schemas and some new ones that I created. I chose URIs for entities in the Linked Open Data tradition, HTTP-dereferencable URIs that resolve to self-describing RDF/XML about the entity. Two good examples are for Senator John McCain and for H.R. 1, the economic recovery bill passed earlier this year. The HTML pages on GovTrack itself tie in to the RDF world through
tags: bill pages include the URI I coined for the bill, for instance.

I also have a sometimes-working-sometimes-not SPARQL endpoint set up, SPARQL being the de facto query language for RDF. SPARQL lets us ask questions of the data, such as how did politicians vote on bills (see example 1). The SPARQL endpoint runs off of a “triple store”, the equivalent of a relational database for the semantic web, which is underlyingly a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. (It uses my own C#/.NET RDF library: http://razor.occams.info/code/semweb.) The RDF/XML returned by dereferencing the URIs is actually auto-generated by redirecting the user to a SPARQL DESCRIBE query (i.e. http://www.rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/congress/111/bills/h1%3E) using URL rewriting in Apache (for a robust solution, see my explanation at the end of http://rdfabout.com/demo/census/). For more about GovTrack’s RDF data, see http://www.govtrack.us/developers/rdf.xpd.

When data gets big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my area of the Semantic Web as several clouds. One cloud is the data I generate from GovTrack. Another cloud is data I separately generate about campaign contributions from data files from the government’s Federal Election Commission (FEC): 10 million triples. This cloud relates politicians to election campaigns and elections, campaign donors with zipcodes, and contribution amounts. A third data set is based on the 2000 U.S. Census, 1 billion triples. The census data has population demographics for many geographic levels, including states, congressional districts, and postal zipcodes (actually “ZCTA”s but we can put that aside). (For more, see http://rdfabout.com. Through the Census cloud the data is linked to Geonames and the rest of the the Linked Open Data community.)

I’ve related the clouds together so we can take interesting slices through them. The GovTrack data connects to the FEC data through politicians. The Census data connects to the GovTrack data through states and congressional districts (the regions represented by senators and representatives) and to the FEC data through zipcodes. That means we ask questions that go beyond one data set such as: what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregated by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode, etc.? Once the Semantic Web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through heavy work of meshing two data sets for each new question once the data is already in RDF with connected URIs.

Figure 1Figure 1

My dream is to be able to plug in SPARQL queries into visualization websites like Many Eyes, Swivel, and mapping tools and instantly get an answer to my question in a compelling form. For now, some copy-paste is necessary. Let’s take an example. Did a state’s median income predict the votes of senators on H.R. 1, the economic recovery bill? Perhaps the senators from the poorest states, likely the most affected by the economic trouble, were more likely to want economic stimulus. This query takes a path through two of my clouds, depicted in Figure 1. The SPARQL query mimics the picture: each edge corresponds to a statement in the query. Except the real query is more complicated (it’s given at http://www.govtrack.us/developers/rdf.xpd). It is complicated not because RDF or SPARQL are inherently complicated, but because the data model that I chose to represent the information is complicated. That is, I made my data set very detailed and precise, and it takes a precise query to access it properly. If you run it on the SPARQL form on that page, get the results in CSV format, copy them into Excel, and run a correlation test, you’d indeed find a moderate correlation between median income and vote, but in the direction opposite to what we expected. (I know why, but I’ll let you think about it.)

figure-2Figure 2

Another interesting case is whether campaign contributions to congressmen mostly come from their district, or if they get contributions from sources far away. The SPARQL query listed in example 2 extracts the relevant numbers for Rep. Steve Israel from New York: for each zipcode, the total amount of campaign contributions he received from individuals with addresses in that zipcode in the last election. Figure 2 puts these values on a map, with congressional districts overlayed as well. A form where you can submit a SPARQL query like these examples and see the results instantly on a map would be incredible for data investigation.

So what is government transparency, practically speaking? It’s more than just information disclosure. Transparency means the public can get answers to their burning questions. The more questions they can answer from a dataset, the more transparency it provides. We can have more transparency without necessarily more disclosure but instead with the ability to apply better tools. Meshing and querying government datasets with RDF and SPARQL could be a new way to reach new heights of civic engagement and public oversight.

Example 1

Get a table of how senators voted on all of the Senate bills in 2009-2010:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bill: <http://www.rdfabout.com/rdf/schema/usbill/>
PREFIX vote: <http://www.rdfabout.com/rdf/schema/vote/>

SELECT ?bill ?voter ?option WHERE {
?bill a bill:SenateBill .
?bill bill:congress "111" ;
bill:hadAction [
a bill:VoteAction ;
bill:vote [
vote:hasOption [
vote:votedBy ?voter ;
rdfs:label ?option ;
]
] ;
] .
}

Example 2

Get total campaign contributions to Rep. Steve Israel by zipcode:

PREFIX fec: <http://www.rdfabout.com/rdf/schema/usfec/>

SELECT ?zipcode ?value WHERE {
?campaign fec:candidate .
?campaign fec:cycle 2008 .
?zipcode fec:zipAggregatedContribution [
fec:toCampaign ?campaign;
fec:amount ?value
] .
?zipcode fec:zcta ?uri .
}

Enhanced by Zemanta

Jim Hendler and Li Ding talk about work to convert Data.Gov resources to RDF

tw-dataIn my latest podcast I talk with Jim Hendler and Li Ding of the Tetherless World Constellation at Rensselaer Polytechnic Institute in Troy, New York.

We discuss work that they and colleagues have been undertaking to convert chunks of the US Federal Government data released via the data.gov portal to RDF.

During the conversation, we refer to the following resources;

This conversation was recorded on Friday 7 August, 2009.

For other Talis podcasts in this Nodalities series, see here

Might Semantic Technologies permit meaningful Brand relationships?

| This post will appear in Nodalities Magazine, Issue 7

by Paul Miller

Much has been written about growing Enterprise use of social media (usually Twitter, these days) to successfully track and mitigate customer complaint. Many have been quick to spot that the disproportionately high cost of satisfying (or, more cynically, silencing) these early adopters is unlikely to scale effectively as an increasingly large cohort of customers move onto these services, and it must remain an open question as to whether ComcastCares and its peers can survive any move to the mainstream in recognisable form.

It appears, though, that Enterprise engagement in the social sphere changes the game far more significantly than merely enabling a select few twitterati to jump the Customer Support queue, and that this change is worth effort and investment in order to ensure that it does scale. What’s actually happening is that a relationship is being enabled between a brand and those that Seth Godin might recognise as its tribe; a relationship in which interactions are no longer driven predominantly by the desire to seek redress. Rather than only raising those issues serious enough for us to have written letters or endured telephone muzak in the past, we now comment on issues at the periphery of a brand. Collectively, we’ve moved from simply complaining about the worst failures of companies, their products and their employees, toward emitting an impressive stream of FYIs. Individually insignificant, and possibly unimportant, together these light touches on and around a brand build into an ever-changing and valuable commentary that brands and the corporations they front would do well to take notice of. The minor niggles about an otherwise exemplary service, the human touches that made us smile, the odd inconsistencies in a polished persona; none are enough to make us pick up the phone, but we comment upon them endlessly in Twitter, Facebook, FriendFeed and elsewhere, and by tapping into this fundamentally honest stream of consciousness there is much for those about whom we comment to learn. Good companies probably already know about fundamental failings in a product long before their customer support operation melts down under the weight of complaints or their quarterly sales targets are seriously under-achieved. Do they have as good a handle on the things we love? Do they have a clue about the minor gripes of customers outside their pre-launch polling groups? Do they know about the gut reaction to a colour, a touch, a smell, or a careless word that persuaded a likely prospect to buy a technically or aesthetically inferior product from the competition instead? All this and more is there for the taking in the stream of online chatter freely directed their way.

Semantic Technologies aren’t often directly associated with the worlds of Marketing and Commerce, yet individuals such as Eric Hillerbrand and Scott Brinker are hard at work to show just what might be possible when the experiences of the Semantic Web are applied to this space. Brands are no longer owned by the companies in whose name they were created. Increasingly, ownership of various forms is being asserted by the multitude of stakeholders with effort and attention invested in the brand. They care about it, they care about what it says about them, and they play a clear role in the brand’s evolution whether its managers want them to or not.

Brands need to engage in this conversation, as we are beginning to see them do, but they also need to discover the means to cost-effectively monitor and engage with a potential flood of third party reaction whilst using the Business Intelligence tools available to them in nimbly shaping public opinion to their advantage wherever possible.

I spoke with Scott Brinker last year, to explore his—then nascent—views on Semantic Marketing, and look forward to hearing his latest thoughts at the Semantic Technology Conference in San Jose in June.

More recently, Eric Hillerbrand talked about some of his ideas with respect to ‘Social Commerce,’ and the ways in which commercial organisations might seek to strengthen and exploit relationships with their customers, aided by a range of semantic technologies.

We’re just beginning to grasp the realities of a world in which tightly controlled and fiercely guarded brand attributes become increasingly permeable. For those companies with the confidence and foresight to loosen their grip, whilst simultaneously exploiting the wealth of data and new opportunities to engage, there is much to be gained. For the dinosaurs that hang on to ‘their’ brand in spite of the world around them, there is everything to lose.

Talking with John Sheridan about e-Government, Open Data and Linked Data

In my latest podcast I talk with John Sheridan, Head of e-Services at the UK Government’s Office of Public Sector Information (OPSI). John is also co-chair of the World Wide Web Consortium’s e-Government Interest Group, and we discuss both roles in the context of current enthusiasm for making Government data more readily available online.

During the conversation, we refer to the following resources;

This conversation was recorded on Wednesday 22 July, 2009.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.