Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Nodalities Magazine' Category

Doing more with your data

|This post will feature in Nodalities Magazine, Issue 5
By Bill Roberts, co-founder of Swirrl

Helping people to do more with their data is what we are trying to achieve with Swirrl.  The ‘data silo’ problem is well-known and significant.  We think we can help to link the silos together.

I spent many years as an IT consultant and through working with a large number of organisations I saw how many inefficiencies there are in the sharing and re-use of data.  This wasn’t because they were bad companies: on the contrary, many of them were large, respected, successful and forward-looking.  But this is a hard problem and the bigger the group and the longer its history, then the harder it is.

Also there are trends in the way we work that are highlighting these difficulties: people are more mobile, working from home or travelling; companies change ownership frequently, meaning any large organisation is in a permanent state of trying to restructure and integrate with its latest acquisition;  through partnerships, outsourcing, contractors and consultants the boundaries of companies are more fluid, meaning that sharing information across the firewall is more important.

On the other hand, the world wide web shows the potential of connecting up large quantities  of information.  The “Enterprise 2.0” concept, of bringing ideas from web collaboration and social media into the world of work, has the potential to add enormous value and it’s a movement that seems to be building in momentum.

So what do people want to do with data, and how are we going to help them do it?  They want to use it to answer specific and sometimes complex questions.  Let’s list some of the requirements:

  1. You have to be able to find relevant data.
  2. You have to be able to understand it.
  3. You want to explore and analyse it.
  4. Once you’ve found your answer, it would be nice to make it easier for the next person with a similar question.

The first step to finding data is storing it somewhere with shared access.  Sounds simple: but a huge amount of information in most organisations doesn’t get over this first hurdle.  If you are working in a group where people are in different places or belong to different companies, then the problem is that much greater.  So our first task was to provide an easy-to-use shared environment for storing and accessing information.

There are lots of ways of doing this: but we are big fans of the wiki approach, for its simplicity, and because it follows the web based model that many people are now familiar with, allowing you to build up a network of links between related items. With a hosted wiki like Swirrl, the information is accessible anywhere with an internet connection and a web browser.  You don’t need to worry about installing client software, or worrying about client compatibility problems (OK, as any web developer knows, there are still compatibility issues across different browsers, but it’s a problem we can live with.)

It has to be as easy to use as creating a spreadsheet on your own hard drive, because if it’s significantly more difficult than that, then people will fall back into bad old habits.

OK, so we’ve made it accessible, now what? Well, you need to be able to organise it and search it.  For organisation, we use tagging and of course hyperlinks, which let you create your own index pages for various purposes.

With a web based model of storage, you open up a lot of options for searching.  But, when you are talking about data and facts, rather than text based documents, then you want to be able to run more structured searches and that means having some kind of structure.  That brings us on to our next point: understanding the data.

In an ideal world, the person who created, or knows about or ‘curates’ the data – who we’ll call the ‘data producer’ – gets together with the person who wants to use it, the ‘data consumer’ and they can talk about it: what the data means, where it came from, what it’s limitations are, how it relates to a particular business process or scientific activity.  At best, this is a time consuming way of doing things and usually it’s not possible at all, because the data producer and data consumer are separated in space or time.  Sometimes the data consumer doesn’t know who the data producer is: or they are not contactable because they are too busy to respond, have moved jobs, left the company, retired, died or otherwise made themselves difficult to get hold of.

So you need to write down what the data means.  What we would really like is to combine data from different sources with the minimum amount of effort: for that to happen those sources have to share some kind of overlapping data model.  You need to compare apples with apples, not apples with oranges.  If that data model can be stored in some kind of machine processable way, then that reduces the amount of human effort and opens up all kinds of new analysis possibilities.

This is where the semantic web comes in.  In our early prototypes of Swirrl, we tried various things: XML + XML Schema, and a kind of JSON-style serialised software objects approach, but we finally settled on RDF as the simplest and most flexible way to store data on any topic.  We needed a system of unique identifiers – so that if you and a colleague are talking about the same thing, you call it the same thing. In RDF everything gets a URL, so you have a system for making references to your data at the finest grained level.

We wanted a simple way to represent information: and although the RDF/XML format sometimes does its best to make it seem complicated, representing information as a collection of RDF ‘triples’ is about as simple as you can get, and fundamentally quite intuitive.  You describe something by listing its attributes and the values of those attributes. The graph, in the mathematical sense of nodes joined by edges, is a flexible way of showing how things are related, less restrictive than relational or hierarchical structures.  And RDFS and OWL provide a way of describing types and relationships between types, in a standardised language.

The difficult part has been how to make it easy and intuitive for people to enter data in this format. We started from the idea that people like their data in tables: spreadsheets, tables in text documents, or database tables.  And when you have lots of data following a similar pattern, a table is an effective way to present it.  We wanted as far as possible to ‘insulate’ users from the fact that we are using RDF behind the scenes, to make it simple to use the system with the minimum of specialist knowledge.  However, if users want to explore and browse through their ‘semantic graph’ we want to make that possible too.

So we organise our data in data sets.  A data set is a table of data embedded in a web page.  We make the simple link that each row of a table represents a ‘thing’, or ‘resource’ in RDF-speak and each of the rows in a data set represents a thing of the same type.  Each column represents a property of that resource.  Each cell in a table is therefore an RDF statement: the cell holds the value, the row is associated with the subject resource and the column is associated with the property.  The value of a cell can be either a literal value, or a reference to another resource.

We also allow properties of linked resources to be included in the row, so a row can in fact be a linearised graph fragment representing multiple resources.  (See Figure 1 and Figure 2, illustrating the same data as a row in a Swirrl data set and as an RDF graph).  When you give a name to a thing or a property, we use that as a unique identifier in a namespace associated with your wiki.  So you can hold information about one thing in lots of different data sets.  To try to encourage a group of users to use consistent naming for things and properties, we use autocompletion when the user is typing such names.  Nonetheless it is inevitable that some items will end up with different names in different places, and we plan to add more tools for marking things as equivalent to each other, and other ways of ‘refactoring’ the data.

Our idea is that this should allow the data model behind the data to be as simple or complex as is required and to grow and evolve as people add and work with their data.  The power of RDF comes with the ease with which you can extend the data model: if you need to model the entire world up front, then you’ll never start.

The end result of this should be a collection of data sets that are interlinked, allowing you to explore the data from different perspectives and to run queries across it.  It also provides a sound base for exporting the data to other systems or publishing it: we’re currently working on Linked Data compliant publishing of RDF data, with a more comprehensive API to follow.

We think the wiki approach has some big advantages for working with data: when it comes to explaining the meaning and context of your data, there’s always a place for good old-fashioned documentation.  We tackle this via regular text-based wiki pages, and by making it easy to link between text and data (in both directions).  This makes it easy for anyone in the group to add to and improve documentation as they work with the data and to make new links. One feature in the pipeline is to add blog-style comments to data sets, allowing a discussion about the data to take place and be recorded for providing additional future context.  These should help make a trail that future users can follow to help them understand and work with the data further.

We launched the first version of Swirrl in late September 2008, so we’re still in the very early days.  We’re gathering feedback on which aspects of the system are easy to work with and which not: and to identify which new features are most important for us to add.  There’s a long way to go, but we think we’re heading in an interesting direction.

Getting Connected

|This post will feature in Nodalities Magazine, Issue 5

Web 2.0, social networking, cloud computing, SaaS, PaaS, Web 3.0, the Semantic Web, Smart Phones, 3G, wifi, convergence…. the list of buzzwords or memes  goes on—meme being the buzzword for buzzwords.

There is nothing new in a long list of industry buzzwords. However, I think this list is different. It comprises a set of ideas which are each huge and transformative in its own right. The fact that they are all happening more or less at once and are all interconnected should give us serious pause for thought.

Maybe they are better considered as symptoms of some deeper, more fundamental change. It is tempting to focus in on a single symptom and try to understand what that will mean for the future—perhaps even take a risk and build a new business around it. But to focus on a single aspect is to miss the bigger picture. The interaction of several different trends tends to produce serious game-changing disruption. In this climate, it is dangerous to become myopic.

Here is how I would describe the fundamental shift:

“Everything is getting connected.”

Obvious? Just to be sure, let me put it another way: EVERYTHING IS GETTING CONNECTED! And I mean everything. I don’t mean every blog, every piece of software, every web page, every database—those are just pieces of which software people think everything is made. I mean everything in the world outside the computer screen.

Since the birth of the computer we have begun to build open, generalised infrastructures. The PC is an open and generalised infrastructure for digitising, processing and materialising data. We use the keyboard to digitise text, a mouse to digitise a set of hand gestures, monitors and printers to turn the data back into physical reality; and software organises all of these processes. After all, software is nothing more than a set of instructions which affects data. But the key word here is generalised. We have built machines for thousands of years but they have tended to address specific needs. The PC is a generalised infrastructure for interacting with digital representations. We might use it to manage content such as pictures, music or video. We might use it to write a novel or a business plan. We might use it to organise a supply chain between people and organisations, track financial information, and assess and analyse inventories.

A generalised infrastructure can reduce or eliminate huge costs involved in getting a job done: factoring out some fixed costs and affecting the residual marginal costs of the project. Another way of saying this is that generalised, open infrastructures have huge spill-over effects. If I buy a computer equipped with MS Office in order to organise my personal accounts, my accounts have cost me maybe £1,000. But, of course, I can now word-process a business plan at a marginal rate (i.e. my time). I can also play a game, listen to music and  surf the web. That £1,000 actually buys me a generalised piece of infrastructure for a huge range of tasks and functions. I’ll leave further discussion of the economics of the spill-over effects created by generalised infrastructures for another time.

Due to the complex nature of these infrastructures, they work much better together when we can agree on some standards. MS Windows and Intel formed a de facto standard which allowed hardware and software to work well together. This partnership has factored out huge complexity by delivering a set of software instructions and processing power to an end user which enables them to manage their data and content. You may argue how much better the world would have been if this standard had been open rather than proprietary, but the point is that the use is generalised and part of the user’s infrastructure.

The internet was the birth of an open, generalised infrastructure for connecting computers. Following this, standards have made these networks work much better and the World Wide Web has provided a set of open standards which made the job of connecting human-readable documents much easier. So, the web has provided a generalised infrastructure for connecting documents.

Yet the web isn’t limited to connecting human-readable documents. Although it may have come to be thought of as an extension of the PC, it is actually a generalised infrastructure for connecting data: html, mp3, streaming video, xml, rdf—anything in fact. To date, it has mostly been used for html and media content but that is changing rapidly. Connected data is the next logical step and with that we must think of devices and standards.

Take a look at the list of buzzwords again. We are in the process of building a generalised infrastructure for connecting anything to everything. Wifi, 3G and bluetooth allow any electronic device to join the conversation Smart phones ensure the human being is always connected. Thinking about all the digital devices in your life, I expect most of them are currently disconnected. They have to solve all the problems themselves: user interaction achieved through some obscure buttons and a tiny display with odd symbols. They are conceived to be isolated.

But wouldn’t it be much better to program your central heating timer with a nice iPhone app that can react to the fact you have left the office? 10 years ago, it would have been impossibly expensive for a heating manufacturer to build a proprietary system allowing customers remotely to programme and adjust their heating. Now, the generalised infrastructure to connect anything is being built, and the huge fixed-cost barrier is being removed. Adding a wifi connection to the central heating controller and exposing the sensor and input data for third party control is already economically doable.

To illustrate this further, imagine how much more valuable it would be for a mass manufacturer to be a bit more connected with their customers. Why, for example, isn’t there a big red help button on my washing machine I can press to talk directly to customer support? The washing machine would know its own model number and any error codes it may be displaying and how to contact help. This morning, I was looking at my washing machine and wondering how to control the temperature. You have to select a specific programme, and each has a certain temperature; but it also boasts a separate temperature dial. Does this override the temperature of the programme? Does it add this temperature to the programme, and I’ll end up with washing soup? Why can’t I literally press a button on the washing machine and immediately ask someone that question and get an answer?

As a products company, that kind of intimate connection with the actual users of my products could be very valuable. For a start, after getting the same questions from many users, they would undoubtedly redesign the temperature control. If the system malfunctioned, the customer service person could give specific advice for an error code, saving time, complaints and dissatisfied customers. This kind of direct relationship has been impossible up to now. With a generalised infrastructure for connecting everything, however, it becomes practicable.

It would cost very little to put a wifi connection on a washing machine and a little Skype-like piece of software which also relays machine status data. I am sure the question is answered in the printed manual, but I dutifully lost that 5 minutes after opening the box. Further, I would not want to go through the hassle of finding the customer services number, finding the model number then reading all that out to the service staff when the machine itself should know all of that. The difference between the effort of simply pressing a button and having to find and relay all that information is functionally massive.

When you think about it, data and devices are everywhere. But they are not connected and they are dumb: they don’t know each other; they don’t know me; and many can’t even recognise themselves.

Almost by definition, there is vastly more localised personal data in the world than generally useful data. Wikipedia is generally useful, and it’s helpful to be able to access a postcode-to-location database. But when I think about the data that is really important to me, it is practically all localised to me.

I would dearly like to have a record of my blood pressure and heart rate from the past year. I would share it with my doctor and maybe my pharmacy. But as useful as it would be, there is no way I would spend the effort of taking my blood pressure and writing it down to put in a computer. However, I do take my blood pressure at home with a digital gauge. It is a device that knows something about me, but it isn’t connected. My blood pressure, the temperature of my house, the location of my children, miles before a service on my car, the error code on my washing machine, the channels I watch on television, all the local restaurants I have been to in the last year—these are all more valuable to me, my family or my friends than they are generally useful to the world. Just think of the number of people in the world times the personal data relevant to them. I would hazard a guess that it is vastly larger than the generic data in the world.

You can see this effect on Facebook, iff you look at most of what is “published”. It is easy to dismiss Facebook as just being full of useless rubbish. If you do, I expect you have fallen into the trap of thinking that just because something is “published” it is meant for you. In fact, most social network content is not a publication, but a conversation that happens to have been digitised. It is intended and meaningful only to a few. It is the same with data. Most data about what I, my family and friends are doing, the things we have, the places we go and the things we need are localised and personal. They are relevant to the few people, organisations and companies I interact with and I want to choose who gets to know what about me.

Up to this point, connecting the devices and data closest to us has been prohibitively expensive, and only a very few people have ever bothered to gather and use this kind of data in a useful way. As the costs of this technology are driven ever-further down, it is becoming increasingly feasible for anyone to have access to these bits and pieces of their lives in a form which can interact and benefit from the infrastructure we build with our devices, houses, cars and companies.

As EVERYTHING gets connected, it is time to get up close and personal.