Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Focus on Local Government Spending

The UK Government Transparency agenda is encouraging Local Government as well as National Government to publish its data as Open Data and Linked Data, reflecting the world leading progress that data.gov.uk has made on these fronts over the last year and a bit.

I am sat in the opening session of Socitm 2010 conference, in sunny Brighton, whilst writing this.  Already it is clear that local government spending is a major issue for the sector.  In it’s broad sense, of how much local authorities can [or cannot] spend

, it is providing the background for the whole conference.  Not doom and glom here though.  IT could be seen as a knight in shining armour  to help the public sector deliver better services what the encouraging thought proffered by Louisa Preston as she launched the day.  In its more narrow sense, the requirement to publish data about all local government spending items over £500 from January 2011 onwards, it gives a focused example of the opportunity for a significant change in thinking and practice by the sector.

As Nodalities readers are well aware, Linked Data tools, techniques, and technologies have massive potential to simplify the publishing, linking, aggregating, and making data work across a web of data.  It is no coincidence that data.gov.uk is making steady valuable progress publishing key data sets in linked data form in the Talis Platform – it is an obvious step.  For many in local government, linked data is something they have never met before.   For them the, traditionally unnatural, step of openly publishing what in the past would have been a private report out of the back of their finance system, is a significant step in itself.

It is the responsibility of those of us, who understand the benefits of taking the extra step beyond just publishing a simple csv file to publish in Linked data form, to make it easy for all authorities to understand and take the combined step of publishing Linked Data from the start.

To that end, we at Talis recently announced a free stores offer for all UK local authorities to publish their spending data as Linked Data.

Traditionally our approach would be host a free open day to help those in local government understand Linked Data and the benefits to them.  Recognising the broader economic climate, and its influence on local government spending in that broader sense, that doesn’t seem to be a good idea.

LGID Many organisations, not least Socitm (there is a Linked Data session at the conference today) and the Local Government Group, in the sector are looking to promote this approach.  We are therefore going to work with the sector to promote this message.

To that end we are to participate in the Open Data strand of the free Local by Social online conference, 3 – 9 November being hosted by LGID. 

As well as checking out, what looks to be a quality online event, stay tuned to the Talis initiatives in this area.

Linked Open Data and Pavlova

rjw_caricature_mini If Sir Tim Berners-Lee can equate Linked Data with a packet of  crisps/potato chips, I thought I would take a stab at another food metaphor for this post. 

Linked Open Data (LOD) is a concept that many believe they understand.  Take yourself to most any conference that has a connection with data, or the web, or the Internet at the moment, and it will not belong before you see a slide of the Linked Open Data cloud diagram, or of Sir Tim imploring us to give him our raw data now, or if you are very lucky a shot of him doing his imploring whilst stood in front of a shot of the LOD cloud.  -  Simple really, just publish your data as Linked Open Data and all will be wonderful as we move towards the sunlit Semantic Web uplands.  Unfortunately life is never that simple – LOD is not a single identifiable thing.  As Paul Walk eloquently puts it:

  1. data can be open, while not being linked
  2. data can be linked, while not being open
  3. data which is both open and linked is increasingly viable
  4. the Semantic Web can only function with data which is both open and linked

As with any recipe for success, the majority concentrate on the final result.  Praising or criticising it as a whole, without identifying the benefits or otherwise, of the individual ingredients.  Take a strawberry pavlova for instance.  If you you are in to that kind of thing, a delightful culmination of the culinary arts designed to send your taste buds in to raptures.  Unless that is, you don’t like cream, or you don’t like strawberries, or can’t abide meringue, in which case the whole thing seems a little pointless.

What has this got to do with Linked Open Data (LOD), I hear you ask.  Well, I am increasingly seeing LOD being presented as the goal for those wishing to publish their data on line.  My position is that the eventual goal, from which will spring a Semantic Web, is a global web of linked and open data. However, there are many steps from where we are now to achieving that goal.  Within audiences that I present to, and/or sit amongst, I see people who for whatever reasons do not ‘get’ one or more of the components of LOD – they cannot envisage opening up any of their data, or think that using a web address for an identifier is over complex, or have a religious aversion to RDF.  As a result they dismiss the whole recipe as not for them, or worse still, as something impractical that will become nothing more than the plaything of a few passionate enthusiasts.

When someone who is still struggling with the concept of opening up their organisation’s data; or why RDF might be a more useful format than csv, is shown the ubiquitous Linked Open Data cloud diagram with encouragement to join in – it is hardly surprising they remain a little unconvinced.  This isn’t a criticism of presenters either.  In only 20 minutes on a stage, it is difficult to go into underlying detail.

Let my try in a few paragraphs to break the LOD pavlova in to it’s ingredients

  •  Data – In the context of  this post, by data I mean machine readable information, produced in a format that can be consumed and processed by other machines.  Inevitably, this means file formats such as csv, XML, RDF, etc. , but not something like pdf, html, or word, which although they are in a transferrable format it is designed for human consumption not machine analysis.

    For some, just this step from their current human targeted format, to a machine readable one, is a significant one.

  • Open Data  – Data (see above) which is accessible for all to download, view, and consume in a way that is not encumbered by licensing that restricts its use.  For example, the licensing used by data.gov.uk data.  By definition data which is restricted for certain uses is not fully open.  

    In our internet based world, openness can also be defined in terms of technical accessibility.  If it is only available after a login process, or it is only available to users behind a firewall, it couldn’t be considered as open. 

  • Linked Data – Data (see above) which contains URIs as identifiers for concepts described in the data and URIs to identify the relationships between those concepts.  The four Linked Data Principles, as published as a design note by Tim Berners-Lee, provide a bit more detail on this.

    I am in danger of stirring the embers of a religious fire fight here, between those that believe that Linked Data must be described in RDF and contain URIs as identifiers, and those that maintain that you can have data linked across the web without those constraints.  All I am going to say on that at this time, is that the Linked Open Data cloud of data sets has been successful, based on the first of those two views. (if you want to follow that particular debate in more detail, Paul Miller’s post and associated comments would be a good starting point)

So, how can data be open, but not linked? – by publishing in in a non-Linked Data form such as a text file or a html page or a pdf.  Where would you find this? – all over the web. As encouraged by Sir Tim to give us your raw data now, and as I detailed in my previous “data publishing three-step’ post, this is often the first element of getting your data out there for others to consume.

How can data be Linked but not open? – by publishing it in accordance with the principles, in RDF, with URIs, but restricting access either by imposing restrictive licensing conditions or restricting access to the data.  Where would you find this? – again all over the web, but often hiding behind restrictive licensing terms such as “non-commercial use only”.  Also to be found inside organisational firewalls.  For example, commercial organisations can realise the benefits of  using Linked Data techniques with their internal private data.  Potentially linking it to publicly visible concepts across the web to add even more value for their employees.

Data that is Linked and Open, like that strawberry pavlova, has the power to deliver value beyond the sum of its individual ingredients.  By providing data in a form that is linked to other data, and easy for others to link to, without restrictions on who or how that linking takes place, provides the foundation for a web of linked data built on the same principles that fostered the growth of the web of documents that has so changed our world over the last decade and a half.

The ingredients that formed that World Wide Web of documents – html, http, open publishing of web sites without restrictions on other’s abilities to consume and/or link to them – individually  were important developments.  However, when those elements were blended together their effects were multiplied many fold and resulted in the web we experience today. 

So [as I stretch my culinary metaphor to it’s limits] if you are hoping to take people with you in building a Linked Open Data future, you not only have to show them a picture of the final dish, you need to describe the individual ingredients and their relevance to the eventual result.

Pictures from Flickr by PhOtOnQuAnTiQuE and avixyz

Facebook: David Recordon talks with talis about the Social Graph

We’ve covered the launch of Facebook’s Open Graph protocol in Nodalities Magazine, discussing its potential impact on Linked Data. So, I invited David Recordon—Facebook’s Senior Open Programs Manager—to talk with Talis about Facebook and the Open Graph Protocol. We ended up talking all about the protocol, how developers can make use of it (and why), as well as touching on Facebook’s view of social networking as a graph.

The Open Graph Protocol page has information about the protocol itself. Facebook’s f8 developers’ conference site also has links with more information for developers.

Extending the Semantic Web (from Crete, with love)

This is my first year attending the ESWC (formerly “European Semantic Web Conference” now the “Extended Semantic Web Conference,” cleverly, the acronym still works) near Heraklion on Crete. It’s only a couple days in, but I thought it’d be a good time to report back to the Nodalities readers. ESWC is a gathering of some of the world’s most influential Semantic Web thinkers, and for me It’s been a few days of meeting people in the flesh with whom I’ve been in touch online for years. As one bloke put it: “What’s kept you away?”

Well, I’m extremely glad I’ve not been kept away this year, and have been excited to see what’s been built recently. ESWC is a very academic conference; indeed I’m quietly auditing the PhD Symposium as I type this. There are papers, PhD symposia, demos and expositions on topics covering anything from ontology development to MapReduce processing of RDF triples. It seems a very fertile seedbed, with many of these ideas having the potential of growing into projects, startups, papers and possibly industries.

I’ve made a subtle and largely subconscious transition by blogging mostly about projects that are up and running. This has been important because the Semantic Web world is no longer one of “someday,” but a world of current and continuous activity. So, I’ve talked about visualisations of data, products running on Linked Data, data.gov.uk et.al.; and I’ve held back on discussing purely possible. It’s been exciting and uplifting to see the conceptual evolve to the proven and working. But this is a reflection of progress—of moving from hypothesis to implementation. It doesn’t mean the concepts have stopped flowing. It’d be a very short story in the history of human communication if the Semantic Web has used up all of its possibilities in ten years!

ESWC is a little microcosm of the wider research going on in Linked Data and related fields. It seems to me that Big Ideas need the traditional frameworks of academic investigation. Questions need to be asked and answered and debated and tried and broken and rebuilt. Much of this science will not become technology, and this is wholly acceptable because it gives the Big Ideas a lot of scope to be refined.

ESWC is just such a place. PhD students and researchers fill the schedule with proposals and reports, and many possibilities are being constantly debated around coffee, beer, and the beach. It’s been a thoroughly fascinating few days, and I’m very much looking forward to more over the next few.

As a quick note, Talis sponsored the Scripting for the Semantic Web challenge for this, its final year. Alexandre Passant and Pablo Mendes won the prize with SPARQLpush.

Open Day Roundups

Well, we’ve had scores of people attend Platform Open Days now. Some have come to the Talis Offices in Birmingham, and others have joined us in Manchester and London. We’ve had a lot of fun, and some fascinating discussions, and I’m very much looking forward to the next one (16th June, in London).

Many people have asked whether the full slides can be found anywhere, so I thought I’d do a quick round-up of the slides, and share them as images on flickr to make it even easier to follow along.

Just follow the links from the images below to a slideshow of the talk.

Here’s the Introduction to Linked Data, covering who Talis is, RDF, and how to Identify, Describe and Respond:

Here’s the Overview of the Talis Platform, explaining our RESTful API, data storage and SPARQL endpoints:

And here are the slides for our introduction to SPARQL—complete with spaceships:

Richard Wallis’ talk about Linked Data in Action can be seen over here, with more details and a dedicated Screencast.

Open Day… Manchester

LeighSo, the Platform Open Day Roadshow has now begun. We will be doing our first non-Birmingham day up in Manchester on 14th May. It’ll be at the University of Manchester Visitors’ Centre on the penultimate day of Future Everything.

We like to keep the Days Open, meaning we want you to take what you need from them, so make sure to leave us feedback on what you’d like to learn. As a rough overview, we’ll be covering Linked Data including what it means to make Data into “Linked Data”. There will be an overview of RDF, and a tutorial of SPARQL: the query language of the Semantic Web. We will also show you examples and demonstrations of Linked Data in action—apps, mashups and visualisations built on Linked Data.

The Open Days are free of charge, limited to 30 folk (discussion doesn’t seem to happen with larger groups), and we’re putting on lunch. So, make sure to reserve your place here.

data.gov.uk and the Talis Platform

Earlier this year Gordon Brown appointed Tim Berners-Lee as an advisor to the Cabinet Office to help the government begin the process of opening up its data. This was one part of the initiation of a project to begin opening up UK government data in a similar style to the US. A key part of Berners-Lee’s vision for putting government data online has been Linked Data which promises to provide a much richer way for citizens to begin accessing, browsing, and using government data.

Several other governments have begun opening up data assets including Australia and New Zealand. These approaches mirror that of the US data.gov site, providing a browsable directory of datasets and links to raw data downloads in a range of different formats. The preview launch of data.gov.uk which was announced at the end of September also includes a directory of datasets which is powered by the software underlying the Comprehensive Knowledge Archive Network. But the site also aims to fulfill Berners-Lee’s vision and in addition provide access to some datasets as Linked Data through SPARQL endpoints.

We’re very pleased to report that the Talis Platform is currently underpinning the delivery of all of the Linked Data and SPARQL endpoints for the data.gov.uk site.

We’ve been quietly supporting the effort for several months now helping out with data management, modelling discussions, and with training on the core technology. There seems to be a very definite appetite in government to not only open the raw data but to also explore the potential for Linked Data. Its clear from today’s announcement about opening up additional aspects of the Ordnance Survey data that there’s a real focus on delivering on the open data promise. While there are certainly some high-profile datasets like the Ordnance Survey or postcode data that may require legislative changes to become open, one of the biggest implementation challenges facing government is pulling together an overall directory of datasets and spreadsheets that are already scattered across multiple departmental websites.

Creating a dataset directory provides the required basic level of infrastructure to allow reuse, by enabling developers to find what they need; publishing Linked Data, SPARQL endpoints, and potentially extra APIs provides an additional set of options for ways to access the data. By letting datasets be browsable by anyone, not just developers, Linked Data offers the potential for anyone to find, discover and reuse interesting datasets. As I illustrated in a recent talk, these approaches are not mutually exclusive and the goal should be maximum utility.

Over on the Talis Platform developer blog we’ve begun showing some ways that the initial datasets, covering UK schools and traffic measurements can be queried in interesting ways. Its been exciting to see people begin to pick up the technology and creating reporting tools to explore the data, but also fantastic to be able to easily view data using only a browser.

There’s clearly still a great deal of work ahead, but the ground work has now been completed: there’s infrastructure in place to support data publishing; official guidelines on creating public sector URIs; and some agreement on best practices for modelling statistical data. The next challenge is to start ramping up the conversion of currently open data into RDF, in order to begin expanding the coverage of the Linked Data.

This is a very exciting project and here at Talis it’s something in which we’re very proud to be playing a role.

Platform Webinars

On 24th September, Talis hosted its first Platform webinar. Ian Davis introduced ProductDB, a Linked-Data view on ProductWiki, and ran through ways to work with the dataset. The next webinar will feature Jeni Tennison who will demonstrate some of her public datasets and how to make use of them. It will take place on Thursday, 22nd October at 1:30PM (GMT) and you can register for free here.

David James talks about Government transparency and the work of Sunlight Labs

Sunlight Labs logoIn my latest podcast I talk with David James of Sunlight Labs, part of the Sunlight Foundation in Washington, DC.

We discuss the Labs’ work to increase Government transparency by making public sector data such as that disseminated via Data.gov more useful.

During the conversation, we refer to the following resources;

This conversation was recorded on Friday 14 August, 2009.

For other Talis podcasts in this Nodalities series, see here

The Greatest Challenge Facing IT

by Lee Feigenbaum and Mike Cataldo

|This article features in Nodalities magazine, Issue 7

As the old adage goes: Time is money.

Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.

anzo-on-the-web-1A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.

This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.

Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.

Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.

Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.

Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.

In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.

The solution is to find data by its meaning rather than its location

The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.

anzo-on-the-web-2With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.

Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.

Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.

Data Collaboration and the Semantic Web

It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:

  • Flexible, define-as-it-arrives, data structures
  • Explicit relationships that travel with the data
  • Data that is accessed by its definition rather than its address
  • Distributed query

As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.

Practical Data Collaboration to Solve IT’s Challenge

Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:

  1. Virtualizes the data so that it is accessible by its description regardless of location.
  2. Lets users create their own views of data.
  3. Fills in the views by traversing the fabric and picking out the relevant information.
  4. Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.

The Right Data…

anzo-for-excel-1At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.

But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.

…For The Right People

Getting the data in front of the right people relies on three things: context, security, and “reach”.

Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.

Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.

Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.

…At The Right Time

Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.

First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.

Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.

Data Collaboration in the Days to Come

Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.

What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.

Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.

Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.

Enhanced by Zemanta