Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Conference Reports' Category

Linked Data Meetup

On Wednesday, I had the privilege to attend the first Linked Data Meetup down in Hammersmith. The day was a storming success, with talks and presentations from all over the Linked Data community: from academia to startups. I think the organisers were slightly overwhelmed, because in the end there were nearly 200 people there, making use of the Talis-sponsored bar well into the evening. Apart from being a good opportunity to catch up with people, this meetup had the feeling of a guild-meet of Linked Data professionals—with lots of different perspectives over similar problems.

The two panel discussions gave the opportunity for quite a range of different views and topics to be covered, and seemed to well. The first was about Government Data and was chaired by Carol Tullo from the Office of Public Sector Information (OPSI) and included Sir Tim Berners-Lee on a panel of five. The topics covered a swathe of issues with public data, licensing, rights and infrastructure. This panel had a certain gravitas I wasn’t expecting from a semi-formal “meetup”, probably because it was representing the UK’s actual public sector data workers. After much discussion about what it means to “link data” and what count as “LInked Data”, I was left with the important point from the discussion: there are important and well-placed people currently working to make public data public, and I look forward to the potential benefits this will have.

The second panel covered a topic which has become very important to me, and which is strongly tied up with the first: the Future of Journalism. Although I was unable to hear much of this discussion (there were a fair few of us in that hall!), I certainly found the questions asked of the panel particularly acute. There was a particular emphasis on advertising and the future of revenue for news media in an online world. From this panel, I took the view that Journalists report on the public happenings of their nations and worlds, and often what they’re working with is made available by the very institutions “making the news”. So, the work on public data has a strong bearing on journalism and on citizens’ collective knowledge of what’s going on in their worlds. Paul Bradshaw, who chaired this panel, published his notes from the session, which will give a good overview of the topics there!

I won’t report on every talk that happened here, though the programme is still available on the Meetup site, and if anyone has any links to slides or photos they’d like to share, just ping them in the comments. I had a great time, and I left feeling hugely excited by many of the projects and trends discussed there.

Linked Data and News Innovation

Whilst attending the recent NewsInnovation event I gave a lightning talk about Linked Data. The talk was proceeded by an introduction to the Guardian Open Platform which reviewed their content and data publishing system, and some of their plans for future development. This set the scene really well as I argued that Linked Data was a natural extension of what the Guardian are doing, and in my half of the session gave a quick overview of Linked Data and its relevance for driving innovation around news reporting. The session was really successful, we had a 25 minute slot and ended up having an interesting discussion about Linked Data, trust, provenance and related issues that ran on for a whole hour; I’m really pleased with how well it went. Especially as I only put the slides together on the way to the event!

My short deck of slides are now up on Slideshare, and in the rest of this blog post I’ll briefly summarise the talk.

I opened by speaking about the fundamental idea behind Linked Data: that data be put online, in a very fine-grained way. This takes us beyond having stable links for datasets or just articles, and yields web identifiers for the Who, Why, What, Where and When of the content: every person; place; category; and event can each be identified, annotated and ultimately linked together into a navigable whole. RDF, as the core technology for Linked Data, is very simple to get to grips with, with the notion of resources and their connections being something that anyone can intuitively grasp in a few minutes.

Readers of this blog will already be aware of the success of the Linked Data movement, and a large and growing amount of data is available for people to use and re-use in their applications. Quality varies considerably across the Linked Data web, but ultimately this is the nature of any web based system. With the growing engagement from organizations like the BBC, Library of Congress, and the New York Times, the availability of good quality data is only going to increase.

So in what way is Linked Data useful for driving increasing innovation and change in the way that news is created, reported and accessed?

Well there are some obvious answers around providing new ways to search and discover relevant content, e.g. everything about a specific individual or place. But there are two specific areas where I think Linked Data is important to driving innovation around news. The first is context, the second provenance.

Using Linked Data we can take a mesh of inter-related facts and figures and wrap it in a narrative that can help others understand that information and its relationships. Trends can be observed and reported on; data can be summarized along with a particular perspective. What’s important about Linked Data is that this contextualisation can happen without losing the assocation between the narrative and the underlying resources — the Who, What, Why, Where and When. Because those links are preserved then the reader has the ability to drill down into the underlying data in order to inspect that data for themselves. The reader can also find other narratives that draw on the same set of data, discovering extra context and alternate viewpoints much more easily. This creates a rich fabric for allowing for navigation between stories and their referents.

The other aspect is Provenance, or more simply: the ability to back-track to the source of some content. If the news were presented as Linked Data then would be able to explore not just relationships between the content, but also journalists and their affiliations. As readers we’ll be able to gain context not just on the stories, but also on the people that are producing them. Through the ability to drill-down into the underlying data, we are presented with the opportunity to confirm conclusions; we can fact check stories for ourselves. The ability to identify and ignore questionable sources, or identify stories that are drawn from inaccurate data or analyses, is something that has been previously been very hard to do.

Issues like context, provenance, and trust are all areas that the Linked Data and semantic web community are actively exploring and have been so some time. I don’t see any other approaches that are really addressing that space. There is clearly lots of interesting work happening around helping people tell stories with data, and understand the context of news stories (e.g. journalisted), but these are largely disconnected efforts: Linked Data should provide a framework for connecting all that together. IMO, this is an area where Linked Data can add real value in a number of different ways.

Talis’ Tour

It’s been a busy couple of months for the Semantic Web research community. At the very end of May the European Semantic Web Conference
returned to Crete, where the series began in 2004. Now in its sixth year the conference reflected the vibrancy of the research community
in this area, the progress made to date, and the increased emphasis on deployment and uptake of Semantic Web technologies. The latter aspect
was noticeable in many parts of the conference, not least of which in the Semantic Web In Use track, a new addition to the ESWC series, co-chaired by Talis Researcher Tom Heath.

With adoption of Semantic Web technologies and Linked Data principles increasingly rapidly, many members of the research community met in
late June at Schloss Dagstuhl in Germany for a seminar titled “Semantic Web: Reflections and Future Directions”. Almost ten years since the first Dagstuhl seminar on the Semantic Web the goal of this event was to learn lessons from the past and map out the research agenda for the next ten years of the field. Again acknowledging the practical aspects of the field, there were lengthy and productive discussions on the topics of hosting and persistence of RDF vocabularies, and the urgent need to examine how Linked Data and the Semantic Web can enhance Human-Computer Interaction; both of which are topics close to our hearts at Talis.

The natural question that arises from exploring the next ten years of research in any field is “who’s going to do all the work?” Fortunately
in early July the Seventh Summer School on Ontological Engineering and the Semantic Web took place in Cercedilla, Spain, part-sponsored by
Talis. This annual event, directed by Enrico Motta (The Open University) and Asun Gomez Perez (Univ. Politécnica de Madrid), provides over 50 students from Europe and beyond with lectures, invited talks and group projects in cutting edge areas of the Semantic Web field, supported by a team of leading researchers. In addition to the knowledge gained from this intense week of study, students of the summer school get to network with their peers and build the very community that will drive forward the Semantic Web research agenda over the next ten years.

Amazon Web Services Start-Up Tour

Last week I was at the London leg of the Amazon startup tour, the afternoon began with an short talk from Adam Selipsky, VP of Amazon Web Services, who gave overview of the origins and principles of AWS and a basic lesson in the utility and economics of cloud computing. Next up was Simone Brunozzi, Technology Evangelist for AWS Europe (http://twitter.com/simon), who got into more depth about the specifics of the more commonly discussed Amazon services (i.e. not Flexible Payment System/Mechanical Turk etc). He noted that there are currently upwards of 400,000 registered developers in the AWS program.

S3

There are currently over 29,000,000,000 objects are currently stored in S3, and the service has seen growth of around 3600% in the past 2 years
One of the lesser known features of S3 is its automatic scaling. S3 automatically places replicas of each object stored into multiple datacentres for redundancy and fault tolerance. What it also does is to automatically increase and decrease the number of distributed replicas in step with demand. So if a particular file suddenly becomes popular, S3 will create more replicas to handle the higher download rate. When that demand subsides, the number of replicas is reduced

EC2

EC2 is probably the service we make most use of at the moment, mainly for creating test lab environments as and when we need them. I think EC2 is probably the best understood of the AWS services right now, as it provides a resource that most of us are really familiar with already, it just does it really, really well. As a case in point, Simone highlighted animoto who, using EC2, were able to ramp up the server farm running their slideshow application from 80 to 3500 servers in around 48 hrs following the unexpected success of their facebook app.

SQS

Most well designed distributed systems employ some kind of queueing as the glue that sticks together loosely coupled component services. SQS was developed by Amazon for precisely this reason, and I often wonder whether we ought to be making more use of this particular service. However, it seems we’re not alone in our hesitancy to embrace SQS, it seems that it’s lack of strictly deterministic behaviour (an SQS queue is not a straight forward FIFO pipe, messages may arrive out of order) seems to be keeping many external developers from using it more (I think that the lack of a standard queuing interface makes people uneasy too as it increases the lock-in to AWS as a provider – a point that was touched on in the Q&A later in the day). My feeling is that this is one of those problems that can be solved by applying a little lateral thinking to the design. The case study detailing the architecture of the GrepTheWeb application built to process data harvested from the Alexa service is a great example of using queues to coordinate a workflow through multiple, independent components.

SimpleDB

Maybe it was just me, but I thought that Simone skipped over SimpleDB a little. Its a shame because SimpleDB feels to me as though its the least well understood of the AWS services (possibly due to it being the most recently unveiled), and I’d like to see more exploration of which use cases its suited to, what its strenghts and limitations are, how (if?) people are actually using it etc.

Futures

Simone closed with a brief view of the AWS roadmap, which in the near future includes more security futures, continuation of the internationalisation of services with EC2 joining S3 in Europe, the upcoming Content Delivery Network offering and the suite of management-tools-as-a-service (MTaaS ?) slated for rollout early in 2009

There was something of interest in each of the customer talks, and I was pleasantly surprised by the way that they all presented balanced assessments of the capabilities of the various Amazon services, there certainly didn’t seem to be any pet developers on show. The presentations that I got the most from were the ones from PutPlace founder Joe Drumgoole, Alan Williamson from MediaFed and Tom St.John of Kontexto

Joe Drumgoole : PutPlace

http://putplace.com is essentially an online backup service, built on AWS. When they started in 2006, their initial business plan included plans to spend $1,000,000 on datacenters and hosting, a plan they ditched in favour of moving to an architecture based on EC2 and S3. They run both application and task server grids, as well as their customer data db on EC2, with just their service monitoring being hosted outside Amazon’s datacenters.

Using EC2 allows them to quickly and easily reproduce their setup both for increasing capacity, and for testing (they currently run 2 grids – one for production and a second for OAT – Operational Acceptance Testing). Joe mentioned that they’ve spent a lot effort on getting the automation right here, something we’re also doing, and that this enables them to set up a grid in 10 minutes, and tear it down in in 30 seconds when they’re done with it.

Some stats on PutPlace:

  • Running on EC2/S3 in production since January 08
  • Backing up ~15000 user files per day
  • Currently spending around $1200 per month on EC2
  • And $500 on S3 – as you’d expect, usage of S3 is increasing constantly (doubling on a monthly basis), but their EC2 usage is largely static

Joe finished off with a wish list for AWS, including a request for more stats, the ability to create EC2 instances in European datacentres (something we’d really like too), and a stable, offline storage service for backups and other data with low frequency access patters (again, something we’d find very useful too).

Alan Williamson : MediaFed

MediaFed provide federation of premium online media from large publishers, such as The BBC, the Guardian and LeMonde.They also monitor and manage content as well as providing demographics and monetization services (could do with some of those!) Their original architecture was of the traditional variety in that they outsourced of hosting real, physical hardware to a managed service provider. Rapid growth prompted move a move to cloud services, and AWS in particular, when it proved impossible to economically scale on demand within the constraints of their hosting arrangement and that adding capacity meant long long lead times of around 10 days

The MediaFed application is composed of a number of frontend webservers, backend servers for RSS crawling, plus a whole bunch of servers doing things like ad insertion and analysing click through data. All of this is supported by Amazon infrastructure, with all of the processing being carried out on EC2 and S3 used for long term storage of logs, database snapshots etc. What I found most interesting about the MediaFed setup is the way they manage deployment as a single application stack. According to Alan, MediaFed is basically a Java app, running on Linux, which helps mitigate cloud vendor lock-in (a topic thats rightly getting a lot of airtime just now). In the past, we’ve taken a similar tack with development of the core platform codebase, a single java deployment the we just squirt onto fairly vanilla linux boxes, then just start the required bits. It simplifies the deployment considerably, and for us that’s crucial. I have wondered lately though how long this strategy will continue to be viable as we add more services (and therefore code) and as our codebase becomes more modular with better componentisation (through continual refactoring at both the code and design levels). In some of our most recent development, we’ve been using Puppet to manage the deployment of both our code and thirdparty dependencies like Java to a bunch of machine both internally and running on EC2. So far, this has worked well for us though it’ll be interesting to see how it develops along with our software.

Another interesting point Alan made was that even though EC2 now comes with a shiny SLA, instances DO go down, and you have to live with it. This calls for some thought when developing your application, handling failures is a core competency for any distributed application, specially one runnng in someone else’s datacentre. MediaFed’s solution to this is that when an instance falls over, they just spin up another to take its place. However, as services running on one node need to be able to reach services running on other nodes, they make what I thought is a novel use of SimpleDB. SimpleDB acts as a global, highly available service registry, when an instance boots one of the steps in its automatic configuration is to register itself to a known location in SimpleDB. The lack of services that compete more or less directly with SimpleDB seems to reduce MediaFed’s potential portability somewhat although thanks to the open APIs, other providers could always implement a compatible service.

Tom St.John : Kontexto

Kontexto is another player in the media analysis space, who aim to provide an on demand media measurement and analysis platform. Essentially, they run a large text collection and analysis infrastructure to provide categorization, storage, search and analytics (think data profiling, topic stats, trends & sentiment analysis etc) services.

By Tom’s admission, he’s ‘not the technical guy’ so his talk focused on the business aspects of cloud computing, especially from the point of view of a start up seeking investment. Tom told us that Kontexto’s cloud based architecture was a big selling point for early stage investors as it reassured them that their money would be spent on developing Kontexto’s USP(s), and not burned up by capital expenditure on ever depreciating hardware. Tom’s talk was the last of the day, so I guess that time was tight, but he did get a chance to list out some of the other things building their service atop AWS has enabled – most of which are particularly pertinent to a startup, but all of which we’ve found relevant ourselves:

  • Experiment and make mistakes without burning money
  • Try out new business models
  • Focus on core software development, not system administration
  • JIT scaling
  • The ability to attack big market opportunities without needing a large capital war chest

The day finished up with a panel involving the previous speakers, followed by a QA session with Adam Selipsky and Amazon’s CTO, the legendary Werner Vogels. The Amazon guys were fairly cagey about the AWS roadmap beyond what’s already been published (to be expected, really), but Werner seemed intrigued by a question regarding GPUs on demand for applications with über-high processing requirements (read into that what you will). The bones of the message I think they were trying put across though was this: the current suite of web services provided by Amazon are the very lowest level blocks that they think are essential to builders of large (and small) scale applications with the “Internet Inside”. They’re the product of building these sorts of applications many many times over, and the current AWS APIs have evolved organically from that process. The implications of this are twofold, firstly: its not a finished work. So as Amazon gather more information about what are useful services to provide, their offerings will be refined and expanded over time (so feedback from the user community is essential for the long term success of AWS). Secondly, we can expect higher level services to emerge as their requirements and commonalities gain clarity, something that we’re already starting to see already.

SWIG-UK

Tomorrow a group of us are off to visit Bristol for the SWIG-UK meetup that HP Labs are kindly hosting. Leigh is giving a talk on using the Talis Platform to publish data and I am running a lightning talks session which should be fun and, hopefully, informative. This time there is a single track with some top quality content which makes things a lot simpler. It should be a good day with lots of time to meet people and catch up with the vast amount of things going on in the Semantic Web space.

Next Generation Business Intelligence at ISWC 2008

The second pre-conference session I attended at ISWC 2008 was a tutorial session on “Knowledge Representation and Extraction for Business Intelligence“.

I attended the session as I was curious to learn about more applied uses of Semantic Web technology particularly in the financial and business context. In terms of content the tutorial veered wildly from overview material through to some quite detailed looks at linguistic and semantic analysis to extract information from business reports. To that end I’m not going to attempt to summarize the full content of the tutorial but will pick out a few areas of interest.

Somne time was spent on looking at XBRL, the standard business reporting language which is becoming increasingly adopted around the world as a standard means to publish and share business reports. The initiative which began in 1999 was recently extended this year to include a European XBRL consortium. The broad goal of the project is to standardize the means and structure of publishing business financial reports with the goal of making it easy to compare and collate reports for regulatory and other purposes. The current financial crisis was referenced as an illustration of the need for greater transparency in business reporting and is an obvious driver for adoption of the technology.

XBRL draws on many of the same concepts as the Semantic Web, in particular the use of “taxonomies” that can be customized by specific businesses, sectors and regulatory areas, but uses XML technologies like XML Schema rather than RDF. There is growing interest in being able to capture this information using RDF and in mapping XBRL taxonomies into Semantic Web ontologies. For example there has been some early work on an the XBRL ontology, as well as some independent exploration and signs that a W3C incubator or interest group might be formed. The speaker at the tutorial also suggested that before long some standard GRDDL connectors would be available to automate the transformation of XBRL documents into RDF.

Much of the tutorial was discussion of applied uses of RDF data and ontologies within the context of the Musing Project an EU funded project exploring “next-generation business intelligence” in the areas of financial risk management, internationalisation and IT operational risk. Some of the applications that have been explored have been collecting company info from a range of multilingual sources; attempting to assess chances of success of a business in a specific region; semi-automated form filling, e.g. for returns; identifying appropriate business partners; and reputation tracking and opinion mining.

Many of the issues faced in the Musing project deal with how to assemble this data with a historical context: while XBRL data may be present for current or recent years, text mining is required to extract this data from historical reports. The last part of the tutorial was a general introduction to Information
Extraction using the Gate toolkit (this starts from around Slide 75 in the Powerpoint slides). This was a good overview of the capabilities of the toolkit and showed some nice use cases. OpenCalais certainly isn’t the only game in town and, while Gate requires more effort to set-up, looks like it could provide a great deal more customisation options for businesses that really need the extra power.

One of the telling things about the overall process was the need to collate useful data from a number of different sources in order to drive the information extraction process. In order to do Named Entity Extraction a good set of reference material is required, e.g. Gazetteers for place names, or lists of people’s names. While much of this data is already available — in Musing they drew on Wikipedia and the CIA World Factbook for example — a lot more information was either available only by crawling the web or from commercial resources. This suggests to me that there’s still a some ground work to be done in unlocking more data sets that can help drive the business intelligence use cases. There’s essentially a domino effect here: exposing often small focused datasets, can end up unlocking huge potential value further down the line.

Jim Hendler at the INSEMTIVE 2008 Workshop

Along with a number of my colleagues, I’m currently attending the ISWC 2008 conference in Karlsruhe, Germany. Yesterday I attended the INSEMTIVE workshop (“Incentives for the Semantic Web”) which aimed to explore incentives for the creation of semantic web content, i.e. encourage the creation of more structured metadata. The workshop papers are available to browse online or you can download the complete proceedings. There were a real mix of papers, covering specific issues such as extraction of semantics from tagging, and identifying information needs of a community by analysing search patterns, through to position papers that attempted to highlight shortcomings in current semantic web applications that deter people from creating metadata.

I found the position papers most interesting if only because they provided confirmation of something that I’ve been thinking for a while now: that people will (and do) create metadata when there are obvious and immediate benefits in them doing so. No-one really consciously sits down to share or create metadata: they sit down to do a specific task and metadata drops out as a side-effect. For me this makes much of the problem highlighted by the workshop one of interaction design: how do we build good task-oriented user interfaces that encourage the creation of semantic web metadata, and how can we illustrate the benefits of semantic web technologies in an incremental fashion? In my opinion solving this will require close collaboration between semantic web researchers and developers, and interaction designers.

The end of the workshop was a discussion session chaired by Jim Hendler. Hendler chose to do a retrospective of some older presentations to explore how thinking has evolved (or not!) with respect to drivers towards the development of the semantic web.

Starting in 1999, Hendler showed some slides from DAML strategy talks that emphasised the need for a number of different areas to align before a real marketplace can be created for semantic web content and applications. These areas were tools, users, and languages (e.g. OWL, etc). Hendler noted that the Semantic Web community had mistakenly focused too heavily on languages and not enough on the other areas. He also thought that “Web 2.0″ had focused primarily on the users, to a lesser extent on the tools, and very little on the language aspects. Hendler thought that this alignment was now taking place.

Moving forward in time to show some slides from 2001-2002, Hendler introduced the idea that the development of the web itself will “force” the evolution of the semantic web, i.e. that internal pressures, such as the need to better manage and extract value from the massive amounts of online information, will require the semantic web to solve specific problems. Hendler observed that the web has demonstrated that people will do more work to share information with others than they will do to help themselves; i.e. people are lazy. When people want to, need to, or are rewarded for sharing information and content then they will work much harder than they would do to manage and organize information purely for their own uses. Hendler noted that there is a tendency to say “we’ll solve the data creation problem at the individual level, as solving it at a group level is harder to manage”, but a look at web history illustrates that the opposite is in fact the case.

Hendler also shared what he thought was the best piece of advice he’d been given by Tim Berners-Lee: start small but viral and you can change many things. Hendler’s slides characterized this as: “My friend sees it, wants one; My competitor sees it, needs one”.

Looking at slides from 2002, Hendler introduced the “Value proposition” supporting the creation of semantic web data & content, i.e. that there has to be some immediate return on the investment in creating metadata.

Hendler finished his retrospective with a slide from a 2008 talk that showed the range of commercial companies, government projects and vertical sectors that were now heavily engaged in the Semantic Web (I was happy to see Talis mentioned in the list!). In Hendler’s opinion there is a growing excitement, that the “next big thing” is going to come from the Semantic Web; not a “Google Killer”, but the next big revolutionary idea or service. The incentives here being the obvious one: money.

Hendler noted that there is a huge amount of data out there and that finding anything in the mess can be a win. So even a little semantics can make a difference here and could provide some competitive advantages. We don’t need perfect answers or solutions, just incremental improvements on what we have now.

I was also happy to see Hendler encourage researchers to “compete in the real world”, noting that they have to work within the context of a real world that is moving very fast, that they can’t really compete with the resources of commercial firms in creating semantic web applications and demonstrators and should instead try and work within that context to demonstrate real value from the technology. Hendler encouraged them to focus on issues of scalability. Does the fundamental technology scale? Do the concepts and ideas scale to a real user base? As an illustration Hendler noted that he was working with a number of companies that were using some simple OWL constructs in order to add semantics to applications, but that none of them were using a formal reasoner just “little pieces of procedural code that scale really well”.

Overall, an interesting workshop!

Paul Miller did a podcast with Jim Hendler back in March if you want to hear more about his thoughts on the Semantic Web.

FOWA

I’ve been out of the Talis office for nearly two weeks—the last of which was on holiday without internet connection of any description. The week before that, however, I spent at FOWA (Future Of Web Apps) in London. Aside from being really good fun, the conference had a bunch of brilliant speakers and an interesting range of topics covered. I followed the business track, which focused on startups, entreprenurial aspects of the tech space, and related areas. (Twitter feed here)

Some highlights from my perspective were talks by Gavin Bell, Simon Wardley, and Ben Huh. I enjoyed the tactical and people-centric approach to the business stream, and like the fact that there was more than just technical descriptions, or how-to’s. Many of the speakers brought genuine insight to their fields, and I liked hearing their stories.

There was a very entrepreneurial quality to the event, not only in the speakers and their talks on business plans, but also in the fact that the event organisers were part of that community. Many questions asked inevitably focused on the downturn in the economic climate, and Jason Calacanis and Julie Meyer both spoke at length to these concerns. The Startup community is particularly sensitive to the situation, and I was very much interested in hearing where they’re coming from.

Two slightly other-side points which came to mind, though. First, it felt a little like the past of web apps, judging more on comments from the development side of the event. There was a lot of talk aout stuff we’ve heard about already (though, those people who’ve done interesting things make the best news, I guess!). Secondly, I failed to hear the words “Semantic” and “Web” strung together the whole time, except maybe in passing.

I wonder who, among the many speakers and stand-holders, is working on the premise that the future of web applications is semantic?

Whisky, Space Missions and Evidence: What’s the Connection?

No, these aren’t the necessary precursors for a conspiracy theory about the moon landings, but three of the topics touched upon at the first VoCamp, which took place recently in Oxford. VoCamps are events where motivated individuals can come together and spend some dedicated time creating vocabularies/ontologies for describing data on the Web.

You may have heard of these vocabulary things before. Two popular examples that have been around for some time and are in widespread usage are FOAF (as in Friend of a Friend), for describing people and who they know, and SIOC, for describing the contents of ‘social media’ sites such as blogs and discussion fora. But why do we need more vocabularies, and why do we need VoCamps?

We need more vocabularies because people are increasingly motivated to share their data online, and need some way of describing the data itself in a structured fashion. If people use the same vocabularies when describing data of the same type, or at least some of the same terms, it makes sharing and integrating those data sets much easier. For example, imagine you and I both run online shops selling sports equipment, and we want to describe the stock we hold, if we use the same vocabulary to describe that stock data then anyone wanting to cross search our two shops will benefit by not having to map my data structure to yours — we’ll have saved them the job by converging on the same vocabulary from day one.

At this point in time there just aren’t enough vocabularies around to describe the wealth of data in the world. Left to their own devices people will simply create ad-hoc vocabularies which do little to aid data sharing. It’s for these reasons that we need VoCamps, where people can put day-to-day distractions to one side and concentrate on creating technically sound vocabularies in domains that interest them, according to some of the best practices in the field.

VoCampOxford2008 was the start of this process. I used the time to work with Ian and others on a vocabulary/ontology for describing Whisky. Leigh created his Space Flight vocabulary — not just a flippant bit of fun, but a crucial component in his desire to make NASA data more widely accessible and easily archived. Other groups at VoCampOxford2008 worked on a vocabularies for describing IRC discussions, evidence, discourse, participation, votes, journeys and scientific data. See this page for more information on the vocabs we created.

Now, while some people would no doubt argue that whisky and space flight constitute the two most important topics around, there’s still some way to go in creating the rich ecosystem of vocabularies required for a Web of data. That’s why the second VoCamp will take place in Galway, Ireland in late November. Anyone interested in getting their hands (metaphorically) dirty and creating some vocabs should register now before the event fills up — it’s free. Given the location I’ll have to spend a little time in Galway refining the Whisky ontology, but no doubt there’ll be plenty of scope for creating vocabularies in other areas. I may even attempt a vocabulary for describing conspiracy theories, but I imagine that no-one would be able to agree on the details!

The Web’s Rich Tapestry

We’ve all read books that linger in our memories. And there are any number of reasons why they might do so; a stirring tale or thought-provoking argument, for example. One book that has stayed with me over the years is The House of Leaves by Mark Danielweski. It’s been described as “the Blair Witch” of haunted house tales, being the story of a house, the people who live there, and those who attempt to document the strange events and structure of the building. The book is quite a challenging read as it is made up of overlapping narratives, documentary evidence from the investigators, etc. As a reader you’re assembling a narrative out of the interlocking pieces of text that the author presents you with.

But, while the tale is one of those slow burrning horror stories that does linger at the back of the mind, that’s not the primary reason why the book has stayed with me. It was the actual structure of the text that was so intriguing: the author has played with the printed form, including the basic layout of the print on the page in an attempt to further promote the mythology of the story and to help convey the labyrinthine nature of the house. For example a typical page might contain several different blocks of text, and much of the story is told through footnotes and footnotes to footnotes, and footnotes to those footnotes. Certain words are coloured differently throughout the text. There are even blocks of text embedded in the page which you have to read downwards through several pages before returning to your starting point. As a reader you’re physically exploring the text much like the characters are exploring the house.

The book is basically a hypertext novel and while certainly not the first to play with the printed form in this way, it was the first that I’d personally encountered. As a hypertext the book appeals to the technologist in me: I’ve given a number of talks over the past few years and in many of these I’ve explored the evolution of hypertext systems. But I’ve also attempted to challenge people’s pre-conceptions about the medium of the web, just as the House of Leaves challenged my conceptions about the printed medium.

My most recent talk was last week at the ALPSP Internationational Conference 2008 which took place last week in Old Windsor. The talk, titled “The Web’s Rich Tapestry“, discussed the link as the basic medium of the web and reviewed how the blurring of boundaries between websites, services and data (aka “Web 2.0″) is enabled by increasingly richer linking between resources. This is part of a move from old broadcast models of information publishing to a more web-like network of interconnected peers each contributing to a dense information medium. The ultimate endpoint of this inherent in the vision of the Semantic Web, and will complete the change from a document-centric to a data-centric world. The Semantic Web, which is just a layer on top of the existing web, is still based on linking. Albeit linking of a more fine-grained and meaningful nature.

The Semantic Web, just like the existing Web, will arrive through the actions of individuals, organizations and businesses, each contributing to the whole by sharing linked data sets; this process is already happening. And, like the Web, the more data is available, the more value there will be for everyone involved. I urged society publishers to begin more openly sharing their metadata and exploring the potential inherent in the Web of Data. I also attempted to do more than just evangelize the potential benefits of the Semantic Web and also tried to provide a few pointers towards where those benefits might be realized.

One obvious benefit relates to the generation of more traffic to content and services. For many publishers a sizeable, if not the majority, of their website traffic is driven by Google referrals. This is an inherently fragile situation, but one that I believe is ultimately temporary. The scale of this traffic generation is obviously due in major part to the popularity of the Google search engine, but it is enabled by their ability to quickly and efficiently crawl websites in order to index content. This provides a large “surface area” to which Google can generate links. By publishing open data, information providers will be able to grow this surface area by at least an order of magnitude due to the more fine-grained data publishing that the Semantic Web entails. All of this data can potential generate new, highly relevant traffic to content and services.

The other area that the Semantic Web will pay off is by enabling much more sophisticated research and analysis tools, not just for academic researchers and students, but also for all of us in our every day consumption of information. In my view there is too much of a focus on search and not enough on information visualisation and analysis tools. I pointed towards some very recent experiments which I think illustrate some of this potential, including Ubiquity and Freebase Parallax. Talis’s own Project Xiphos is also exploring the innovation that can follow from re-purposing publishing metadata, a topic that was particularly relevant to the ALPSP audience. In my new role as Programme Manager for the Talis Platform, I’m excited to begin exploring how we can start helping businesses to begin drawing value from the rapidly growing Web of Data.