Nodalities

From Semantic Web to Web of Data
Nodalities

Subscribe

  • Any Podcatcher
  • Any Feed Reader

Updates

Follow us on:

Categories

Archives

RSS Incoming Links

  • An error has occurred; the feed is probably down. Try again later.

License

Creative Commons License

Doing more with your data

|This post will feature in Nodalities Magazine, Issue 5
By Bill Roberts, co-founder of Swirrl

Helping people to do more with their data is what we are trying to achieve with Swirrl.  The ‘data silo’ problem is well-known and significant.  We think we can help to link the silos together.

I spent many years as an IT consultant and through working with a large number of organisations I saw how many inefficiencies there are in the sharing and re-use of data.  This wasn’t because they were bad companies: on the contrary, many of them were large, respected, successful and forward-looking.  But this is a hard problem and the bigger the group and the longer its history, then the harder it is.

Also there are trends in the way we work that are highlighting these difficulties: people are more mobile, working from home or travelling; companies change ownership frequently, meaning any large organisation is in a permanent state of trying to restructure and integrate with its latest acquisition;  through partnerships, outsourcing, contractors and consultants the boundaries of companies are more fluid, meaning that sharing information across the firewall is more important.

On the other hand, the world wide web shows the potential of connecting up large quantities  of information.  The “Enterprise 2.0” concept, of bringing ideas from web collaboration and social media into the world of work, has the potential to add enormous value and it’s a movement that seems to be building in momentum.

So what do people want to do with data, and how are we going to help them do it?  They want to use it to answer specific and sometimes complex questions.  Let’s list some of the requirements:

  1. You have to be able to find relevant data.
  2. You have to be able to understand it.
  3. You want to explore and analyse it.
  4. Once you’ve found your answer, it would be nice to make it easier for the next person with a similar question.

The first step to finding data is storing it somewhere with shared access.  Sounds simple: but a huge amount of information in most organisations doesn’t get over this first hurdle.  If you are working in a group where people are in different places or belong to different companies, then the problem is that much greater.  So our first task was to provide an easy-to-use shared environment for storing and accessing information.

There are lots of ways of doing this: but we are big fans of the wiki approach, for its simplicity, and because it follows the web based model that many people are now familiar with, allowing you to build up a network of links between related items. With a hosted wiki like Swirrl, the information is accessible anywhere with an internet connection and a web browser.  You don’t need to worry about installing client software, or worrying about client compatibility problems (OK, as any web developer knows, there are still compatibility issues across different browsers, but it’s a problem we can live with.)

It has to be as easy to use as creating a spreadsheet on your own hard drive, because if it’s significantly more difficult than that, then people will fall back into bad old habits.

OK, so we’ve made it accessible, now what? Well, you need to be able to organise it and search it.  For organisation, we use tagging and of course hyperlinks, which let you create your own index pages for various purposes.

With a web based model of storage, you open up a lot of options for searching.  But, when you are talking about data and facts, rather than text based documents, then you want to be able to run more structured searches and that means having some kind of structure.  That brings us on to our next point: understanding the data.

In an ideal world, the person who created, or knows about or ‘curates’ the data - who we’ll call the ‘data producer’ - gets together with the person who wants to use it, the ‘data consumer’ and they can talk about it: what the data means, where it came from, what it’s limitations are, how it relates to a particular business process or scientific activity.  At best, this is a time consuming way of doing things and usually it’s not possible at all, because the data producer and data consumer are separated in space or time.  Sometimes the data consumer doesn’t know who the data producer is: or they are not contactable because they are too busy to respond, have moved jobs, left the company, retired, died or otherwise made themselves difficult to get hold of.

So you need to write down what the data means.  What we would really like is to combine data from different sources with the minimum amount of effort: for that to happen those sources have to share some kind of overlapping data model.  You need to compare apples with apples, not apples with oranges.  If that data model can be stored in some kind of machine processable way, then that reduces the amount of human effort and opens up all kinds of new analysis possibilities.

This is where the semantic web comes in.  In our early prototypes of Swirrl, we tried various things: XML + XML Schema, and a kind of JSON-style serialised software objects approach, but we finally settled on RDF as the simplest and most flexible way to store data on any topic.  We needed a system of unique identifiers - so that if you and a colleague are talking about the same thing, you call it the same thing. In RDF everything gets a URL, so you have a system for making references to your data at the finest grained level.

We wanted a simple way to represent information: and although the RDF/XML format sometimes does its best to make it seem complicated, representing information as a collection of RDF ‘triples’ is about as simple as you can get, and fundamentally quite intuitive.  You describe something by listing its attributes and the values of those attributes. The graph, in the mathematical sense of nodes joined by edges, is a flexible way of showing how things are related, less restrictive than relational or hierarchical structures.  And RDFS and OWL provide a way of describing types and relationships between types, in a standardised language.

The difficult part has been how to make it easy and intuitive for people to enter data in this format. We started from the idea that people like their data in tables: spreadsheets, tables in text documents, or database tables.  And when you have lots of data following a similar pattern, a table is an effective way to present it.  We wanted as far as possible to ‘insulate’ users from the fact that we are using RDF behind the scenes, to make it simple to use the system with the minimum of specialist knowledge.  However, if users want to explore and browse through their ’semantic graph’ we want to make that possible too.

So we organise our data in data sets.  A data set is a table of data embedded in a web page.  We make the simple link that each row of a table represents a ‘thing’, or ‘resource’ in RDF-speak and each of the rows in a data set represents a thing of the same type.  Each column represents a property of that resource.  Each cell in a table is therefore an RDF statement: the cell holds the value, the row is associated with the subject resource and the column is associated with the property.  The value of a cell can be either a literal value, or a reference to another resource.

We also allow properties of linked resources to be included in the row, so a row can in fact be a linearised graph fragment representing multiple resources.  (See Figure 1 and Figure 2, illustrating the same data as a row in a Swirrl data set and as an RDF graph).  When you give a name to a thing or a property, we use that as a unique identifier in a namespace associated with your wiki.  So you can hold information about one thing in lots of different data sets.  To try to encourage a group of users to use consistent naming for things and properties, we use autocompletion when the user is typing such names.  Nonetheless it is inevitable that some items will end up with different names in different places, and we plan to add more tools for marking things as equivalent to each other, and other ways of ‘refactoring’ the data.

Our idea is that this should allow the data model behind the data to be as simple or complex as is required and to grow and evolve as people add and work with their data.  The power of RDF comes with the ease with which you can extend the data model: if you need to model the entire world up front, then you’ll never start.

The end result of this should be a collection of data sets that are interlinked, allowing you to explore the data from different perspectives and to run queries across it.  It also provides a sound base for exporting the data to other systems or publishing it: we’re currently working on Linked Data compliant publishing of RDF data, with a more comprehensive API to follow.

We think the wiki approach has some big advantages for working with data: when it comes to explaining the meaning and context of your data, there’s always a place for good old-fashioned documentation.  We tackle this via regular text-based wiki pages, and by making it easy to link between text and data (in both directions).  This makes it easy for anyone in the group to add to and improve documentation as they work with the data and to make new links. One feature in the pipeline is to add blog-style comments to data sets, allowing a discussion about the data to take place and be recorded for providing additional future context.  These should help make a trail that future users can follow to help them understand and work with the data further.

We launched the first version of Swirrl in late September 2008, so we’re still in the very early days.  We’re gathering feedback on which aspects of the system are easy to work with and which not: and to identify which new features are most important for us to add.  There’s a long way to go, but we think we’re heading in an interesting direction.

One Response

  1. The Park Paradigm - Semantic, shemantic…rich, open data is what we want. Says:

    [...] Doing more with your data (blogs.talis.com) [...]

Leave a Reply