Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Archive for the 'Tech Talk' Category

Linked Data and News Innovation

Whilst attending the recent NewsInnovation event I gave a lightning talk about Linked Data. The talk was proceeded by an introduction to the Guardian Open Platform which reviewed their content and data publishing system, and some of their plans for future development. This set the scene really well as I argued that Linked Data was a natural extension of what the Guardian are doing, and in my half of the session gave a quick overview of Linked Data and its relevance for driving innovation around news reporting. The session was really successful, we had a 25 minute slot and ended up having an interesting discussion about Linked Data, trust, provenance and related issues that ran on for a whole hour; I’m really pleased with how well it went. Especially as I only put the slides together on the way to the event!

My short deck of slides are now up on Slideshare, and in the rest of this blog post I’ll briefly summarise the talk.

I opened by speaking about the fundamental idea behind Linked Data: that data be put online, in a very fine-grained way. This takes us beyond having stable links for datasets or just articles, and yields web identifiers for the Who, Why, What, Where and When of the content: every person; place; category; and event can each be identified, annotated and ultimately linked together into a navigable whole. RDF, as the core technology for Linked Data, is very simple to get to grips with, with the notion of resources and their connections being something that anyone can intuitively grasp in a few minutes.

Readers of this blog will already be aware of the success of the Linked Data movement, and a large and growing amount of data is available for people to use and re-use in their applications. Quality varies considerably across the Linked Data web, but ultimately this is the nature of any web based system. With the growing engagement from organizations like the BBC, Library of Congress, and the New York Times, the availability of good quality data is only going to increase.

So in what way is Linked Data useful for driving increasing innovation and change in the way that news is created, reported and accessed?

Well there are some obvious answers around providing new ways to search and discover relevant content, e.g. everything about a specific individual or place. But there are two specific areas where I think Linked Data is important to driving innovation around news. The first is context, the second provenance.

Using Linked Data we can take a mesh of inter-related facts and figures and wrap it in a narrative that can help others understand that information and its relationships. Trends can be observed and reported on; data can be summarized along with a particular perspective. What’s important about Linked Data is that this contextualisation can happen without losing the assocation between the narrative and the underlying resources — the Who, What, Why, Where and When. Because those links are preserved then the reader has the ability to drill down into the underlying data in order to inspect that data for themselves. The reader can also find other narratives that draw on the same set of data, discovering extra context and alternate viewpoints much more easily. This creates a rich fabric for allowing for navigation between stories and their referents.

The other aspect is Provenance, or more simply: the ability to back-track to the source of some content. If the news were presented as Linked Data then would be able to explore not just relationships between the content, but also journalists and their affiliations. As readers we’ll be able to gain context not just on the stories, but also on the people that are producing them. Through the ability to drill-down into the underlying data, we are presented with the opportunity to confirm conclusions; we can fact check stories for ourselves. The ability to identify and ignore questionable sources, or identify stories that are drawn from inaccurate data or analyses, is something that has been previously been very hard to do.

Issues like context, provenance, and trust are all areas that the Linked Data and semantic web community are actively exploring and have been so some time. I don’t see any other approaches that are really addressing that space. There is clearly lots of interesting work happening around helping people tell stories with data, and understand the context of news stories (e.g. journalisted), but these are largely disconnected efforts: Linked Data should provide a framework for connecting all that together. IMO, this is an area where Linked Data can add real value in a number of different ways.

Growing the Web of Data with Data Incubator

At Talis we’re huge fans of Linked Data, especially when it’s freely available for reuse too. However, we also realise that not everyone has been smitten by the Linked Data bug yet so we’re always thinking about new ways to help others use, publish and discover the benefits of connecting their data together.

Recently we were wondering how we could help organise the skill and expertise of people who love Linked Data to show data publishers how their data could be even more useful and effective. As the Linking Open Data project has shown, actions speak louder than words so we wanted to do something with practical and visible results.

One problem we face is that until it is available in open and reusable formats it’s not possible to show data owners the power locked up in their own data. Conversely it is hard for the data owner to justify investment in opening up their data without concrete demonstrations of that power. A classic deadlock situation! The goal of our new project is to break this deadlock. We plan to do this by organising people around popular datasets to create mappings to RDF, write conversion code and openly publish the resulting data. The result will be a huge reduction in the investment needed by the data owner: they can simply adapt the work and emit the Linked Data themselves.

We call our new project the Data Incubator and if you love Linked Data then we encourage you to join in and help grow the web of data. Although this project is entirely independent of Talis, we are supporting it through the Talis Connected Commons scheme, providing free hosting and services for public domain data.

Already we have started projects to convert the Open Library dataset including much-loved books such as The Hobbit and to convert journal metadata provided by CrossRef, Highwire and the National Library of Medicine. Many more projects are being incubated and we are discussing how we create a repeatable process for contacting and encouraging data owners to take part.

Join the Data Incubator mailing list and get involved.

Amazon Web Services Start-Up Tour

Last week I was at the London leg of the Amazon startup tour, the afternoon began with an short talk from Adam Selipsky, VP of Amazon Web Services, who gave overview of the origins and principles of AWS and a basic lesson in the utility and economics of cloud computing. Next up was Simone Brunozzi, Technology Evangelist for AWS Europe (http://twitter.com/simon), who got into more depth about the specifics of the more commonly discussed Amazon services (i.e. not Flexible Payment System/Mechanical Turk etc). He noted that there are currently upwards of 400,000 registered developers in the AWS program.

S3

There are currently over 29,000,000,000 objects are currently stored in S3, and the service has seen growth of around 3600% in the past 2 years
One of the lesser known features of S3 is its automatic scaling. S3 automatically places replicas of each object stored into multiple datacentres for redundancy and fault tolerance. What it also does is to automatically increase and decrease the number of distributed replicas in step with demand. So if a particular file suddenly becomes popular, S3 will create more replicas to handle the higher download rate. When that demand subsides, the number of replicas is reduced

EC2

EC2 is probably the service we make most use of at the moment, mainly for creating test lab environments as and when we need them. I think EC2 is probably the best understood of the AWS services right now, as it provides a resource that most of us are really familiar with already, it just does it really, really well. As a case in point, Simone highlighted animoto who, using EC2, were able to ramp up the server farm running their slideshow application from 80 to 3500 servers in around 48 hrs following the unexpected success of their facebook app.

SQS

Most well designed distributed systems employ some kind of queueing as the glue that sticks together loosely coupled component services. SQS was developed by Amazon for precisely this reason, and I often wonder whether we ought to be making more use of this particular service. However, it seems we’re not alone in our hesitancy to embrace SQS, it seems that it’s lack of strictly deterministic behaviour (an SQS queue is not a straight forward FIFO pipe, messages may arrive out of order) seems to be keeping many external developers from using it more (I think that the lack of a standard queuing interface makes people uneasy too as it increases the lock-in to AWS as a provider – a point that was touched on in the Q&A later in the day). My feeling is that this is one of those problems that can be solved by applying a little lateral thinking to the design. The case study detailing the architecture of the GrepTheWeb application built to process data harvested from the Alexa service is a great example of using queues to coordinate a workflow through multiple, independent components.

SimpleDB

Maybe it was just me, but I thought that Simone skipped over SimpleDB a little. Its a shame because SimpleDB feels to me as though its the least well understood of the AWS services (possibly due to it being the most recently unveiled), and I’d like to see more exploration of which use cases its suited to, what its strenghts and limitations are, how (if?) people are actually using it etc.

Futures

Simone closed with a brief view of the AWS roadmap, which in the near future includes more security futures, continuation of the internationalisation of services with EC2 joining S3 in Europe, the upcoming Content Delivery Network offering and the suite of management-tools-as-a-service (MTaaS ?) slated for rollout early in 2009

There was something of interest in each of the customer talks, and I was pleasantly surprised by the way that they all presented balanced assessments of the capabilities of the various Amazon services, there certainly didn’t seem to be any pet developers on show. The presentations that I got the most from were the ones from PutPlace founder Joe Drumgoole, Alan Williamson from MediaFed and Tom St.John of Kontexto

Joe Drumgoole : PutPlace

http://putplace.com is essentially an online backup service, built on AWS. When they started in 2006, their initial business plan included plans to spend $1,000,000 on datacenters and hosting, a plan they ditched in favour of moving to an architecture based on EC2 and S3. They run both application and task server grids, as well as their customer data db on EC2, with just their service monitoring being hosted outside Amazon’s datacenters.

Using EC2 allows them to quickly and easily reproduce their setup both for increasing capacity, and for testing (they currently run 2 grids – one for production and a second for OAT – Operational Acceptance Testing). Joe mentioned that they’ve spent a lot effort on getting the automation right here, something we’re also doing, and that this enables them to set up a grid in 10 minutes, and tear it down in in 30 seconds when they’re done with it.

Some stats on PutPlace:

  • Running on EC2/S3 in production since January 08
  • Backing up ~15000 user files per day
  • Currently spending around $1200 per month on EC2
  • And $500 on S3 – as you’d expect, usage of S3 is increasing constantly (doubling on a monthly basis), but their EC2 usage is largely static

Joe finished off with a wish list for AWS, including a request for more stats, the ability to create EC2 instances in European datacentres (something we’d really like too), and a stable, offline storage service for backups and other data with low frequency access patters (again, something we’d find very useful too).

Alan Williamson : MediaFed

MediaFed provide federation of premium online media from large publishers, such as The BBC, the Guardian and LeMonde.They also monitor and manage content as well as providing demographics and monetization services (could do with some of those!) Their original architecture was of the traditional variety in that they outsourced of hosting real, physical hardware to a managed service provider. Rapid growth prompted move a move to cloud services, and AWS in particular, when it proved impossible to economically scale on demand within the constraints of their hosting arrangement and that adding capacity meant long long lead times of around 10 days

The MediaFed application is composed of a number of frontend webservers, backend servers for RSS crawling, plus a whole bunch of servers doing things like ad insertion and analysing click through data. All of this is supported by Amazon infrastructure, with all of the processing being carried out on EC2 and S3 used for long term storage of logs, database snapshots etc. What I found most interesting about the MediaFed setup is the way they manage deployment as a single application stack. According to Alan, MediaFed is basically a Java app, running on Linux, which helps mitigate cloud vendor lock-in (a topic thats rightly getting a lot of airtime just now). In the past, we’ve taken a similar tack with development of the core platform codebase, a single java deployment the we just squirt onto fairly vanilla linux boxes, then just start the required bits. It simplifies the deployment considerably, and for us that’s crucial. I have wondered lately though how long this strategy will continue to be viable as we add more services (and therefore code) and as our codebase becomes more modular with better componentisation (through continual refactoring at both the code and design levels). In some of our most recent development, we’ve been using Puppet to manage the deployment of both our code and thirdparty dependencies like Java to a bunch of machine both internally and running on EC2. So far, this has worked well for us though it’ll be interesting to see how it develops along with our software.

Another interesting point Alan made was that even though EC2 now comes with a shiny SLA, instances DO go down, and you have to live with it. This calls for some thought when developing your application, handling failures is a core competency for any distributed application, specially one runnng in someone else’s datacentre. MediaFed’s solution to this is that when an instance falls over, they just spin up another to take its place. However, as services running on one node need to be able to reach services running on other nodes, they make what I thought is a novel use of SimpleDB. SimpleDB acts as a global, highly available service registry, when an instance boots one of the steps in its automatic configuration is to register itself to a known location in SimpleDB. The lack of services that compete more or less directly with SimpleDB seems to reduce MediaFed’s potential portability somewhat although thanks to the open APIs, other providers could always implement a compatible service.

Tom St.John : Kontexto

Kontexto is another player in the media analysis space, who aim to provide an on demand media measurement and analysis platform. Essentially, they run a large text collection and analysis infrastructure to provide categorization, storage, search and analytics (think data profiling, topic stats, trends & sentiment analysis etc) services.

By Tom’s admission, he’s ‘not the technical guy’ so his talk focused on the business aspects of cloud computing, especially from the point of view of a start up seeking investment. Tom told us that Kontexto’s cloud based architecture was a big selling point for early stage investors as it reassured them that their money would be spent on developing Kontexto’s USP(s), and not burned up by capital expenditure on ever depreciating hardware. Tom’s talk was the last of the day, so I guess that time was tight, but he did get a chance to list out some of the other things building their service atop AWS has enabled – most of which are particularly pertinent to a startup, but all of which we’ve found relevant ourselves:

  • Experiment and make mistakes without burning money
  • Try out new business models
  • Focus on core software development, not system administration
  • JIT scaling
  • The ability to attack big market opportunities without needing a large capital war chest

The day finished up with a panel involving the previous speakers, followed by a QA session with Adam Selipsky and Amazon’s CTO, the legendary Werner Vogels. The Amazon guys were fairly cagey about the AWS roadmap beyond what’s already been published (to be expected, really), but Werner seemed intrigued by a question regarding GPUs on demand for applications with über-high processing requirements (read into that what you will). The bones of the message I think they were trying put across though was this: the current suite of web services provided by Amazon are the very lowest level blocks that they think are essential to builders of large (and small) scale applications with the “Internet Inside”. They’re the product of building these sorts of applications many many times over, and the current AWS APIs have evolved organically from that process. The implications of this are twofold, firstly: its not a finished work. So as Amazon gather more information about what are useful services to provide, their offerings will be refined and expanded over time (so feedback from the user community is essential for the long term success of AWS). Secondly, we can expect higher level services to emerge as their requirements and commonalities gain clarity, something that we’re already starting to see already.

OPO: modelling dynamic online presence

At Talis, we’re very interested in the development of the Semantic Web, and we’re always happy when other members of this space share what they’re doing with us. I was contacted a couple weeks ago by Milan Stankovic, a member of the Good Old Ai research from Belgrade. He’s been working on the OPO (Online Presence Ontology), which aims to model the dynamic aspects of a user’s presence online: taking a leaf out of twitter’s book, but tying it in semantically with the rest of the web. I’ve asked him to share a bit about their project with us.

So, Milan, what is “online presence”, and in what way is it “dynamic”?

I think that expansion of socialising services, like social networks, Twitter, lifestreaming services, etc. has significantly changed the way we socialize. When our friends publish custom messages on social networks, send tweets or set their IM statuses, we become more aware of their current activities and thoughts. When we assemble all that information we get a rich image of their presence in the online world.

Since the data that forms this image is spread over different services (and often repeated) we came up with the idea that it could be useful to make a model for its semantic representation and meaningful exchange. So we created an ontology – the Online Presence Ontology (OPO) to enable the integration of those pieces of information about a user’s online presence. Apart from that, OPO also enables the transfer of online presence related data from one service to another without the loss of semantics.

We believe that with the expansion of internet-enabled mobile devices, as users are more and more online, the topic of online presence will gain even more importance. Maybe even new ways to express your state of being present online will arise in this context. For this reason we did our best to make OPO flexible and extensible enough to survive the evolution of the online presence concept itself.

So, does this have anything to do with the already-existing FOAF ontology?

For understanding OPO and the notion of online presence itself, a comparison to FOAF might be essential. It is very important to distinguish the static and more persistent properties modeled by FOAF (like name, gender, homepage, etc.) from frequently changing properties addressed by the OPO (like custom message and IM status). The OPO is actually meant for representing dynamic aspects of user profiles, and we may say that it complements FOAF in a way. It is therefore quite natural that OPO is connected to FOAF trough some properties.

How do you see this actually being implemented?

Apart from facilitating the integration of online presence data from various sources, OPO can also be beneficial for transferring data from one service to another. I personally know users who copy-paste their custom messages from gTalk to Facebook. This manual work is an annoyance we can easily relieve users from by introducing a meaningful data exchange between services. The first thing we need is a semantic representation and then the exchange mechanisms can be built on top of the ideas outlined by the Data Portability initiative.

The domain where we consider OPO’s contribution to be of greatest importance is the exchange of IM statuses. Currently different IM platforms use different status scales, and when users from different platforms meet in inter-platform chat (on services like Meebo, Digsby, etc.) their statuses are exchanged over XMPP protocol by mapping them all to a very poor status scale used in XMPP. In those mappings the semantics of original statuses is largely reduced. To face this issue OPO allows precise descriptions of IM status characteristics so that they can be meaningfully exchanged between platforms.

So, where are you taking this next?

We are currently working to extend the ontology with new features. One of the improvements will be the ability to add geographical location to your Online Presence. This will support travel twitting and will have its applications in recently emerged location based social networks.

Another interesting extension will be the support for describing current music track that users sometimes state on IM platforms. Compared to the existing possibility to see the name of the song my IM contacts are listening to, semantic representation of music should bring the functionality to a higher level, by allowing IM programs to find and let me play that music. The infrastructure for this is already provided by the Music Ontology project as well as DBTune; we just have to connect it with OPO.

We will soon put this new version of the ontology for public review on the project website and we hope to get community comments and attract the community to participate in making the ontology even more usable.

In parallel we are working on plugins for some social networks and IM programs in order to bring the enabled interoperability to life.

Thanks, Milan.

If you’d like to check out the ontology yourself, or to read more about it, you can find it here:
OPO Website : http://www.milanstankovic.org/opo/
OPO URI : http://ggg.milanstankovic.org/opo/ns/

A passing observation on SaaS

Back in January, I noticed an intriguing idea from Jeff Jarvis : @twitcrit: instareviews. Basically to use the Twitter microblogging tool to post mini reviews. I couldn’t resist having a quick go at an implementation of what Jeff described. Fast mover that he is, Dave Winer got an implementation together ahead of me – see Jeff’s subsequent post.

Now programming skill doesn’t really come into this, the application is pretty straightforward, only took me a couple of hours to write my code. I assume Dave used his own platform based on Frontier, the service being maintained by himself. I used the Talis Platform. Although I work for Talis, I have nothing to do with the maintenance of the service – if effect I’m a 3rd party coding against a Web API (one based entirely on standard HTTP, but that’s another story).

Five months later, the twitcrit idea didn’t really catch on, and to be honest I’d pretty much forgotten about it. But checking back, my app is still live. Also in the meantime it’s been happily aggregating the data that’s passed through. I never got around to a proper search interface, but because the store is SPARQL-enabled, it is all searchable. Now check Dave’s version.

So my passing observation on SaaS is that in delegating infrastructure maintenance, you can just write your app and forget about it.

Google App Engine and the Joy of WebArch

Google App Engine Logo
Responses to the announcement of the Google App Engine have been mixed, from Tim Bray’s somewhat negative Sharecropping, to an awful lot of “very cool“s, with Niall Kennedy‘s tech description providing a reasonably neutral common ground. I’ve been meaning to post about it, but I’ve a couple of pressing deadlines and haven’t had time. I didn’t think “Python – great! But this thing really isn’t forward-looking” would be doing it justice. However this morning I ran across a couple of blog posts on which I felt obliged to comment, and I just realised that most of my main points about the Google App Engine leaked out into those comments. So with apologies in lieu of better treatment, here goes -

Comment on Gabe Wachob’s Google App Engine: Its the Architecture Stupid! :

Nice post! The first I’ve seen to highlight the significance of the architecture.

While I think your analysis is generally on the nail, I’m not so sure about the conclusions. The thing is, App Engine architecture isn’t Web architecture.

As you point out there are nice reusable abstractions (like events etc), but the primary interfaces are all down at the code level.

“If you build your app on the Google App Engine architecture, it will scale to unlimited levels without any extra effort.” – yes, but only on the Google App Engine.

Rather than hoping for open source implementations of similar toolkits, if a HTTP facade were put over things like BigTable, the specific implementation wouldn’t matter – to change that you’d only have to change a few URIs, not all your code. (One for the LazyWeb).

Commoditization (commodification?) works best where there are common standards. A railroad engine isn’t a commodity if you have to build your own track :-)

See also: Cloud: commodity or proprietary?

Comment on Swaroop C H’s Web dev frameworks vs RIA :

[On the question of how one develops both client- and server-side with frameworks] I’d suggest that if Web standards are used as a common interface, it really doesn’t matter!

Ok, an example. A while ago I needed an easy personal activity tracker. I wanted it in my face a bit on the desktop, which called for something RIA-ish. I wanted the data available in a form that’s reusable, and I want a straightforward view on the web (so my colleagues could see what I was working on).

So I wrote a little desktop app in Java. It’s essentially MVC, with a fairly trivial domain-specific model – I have activity items with title, description and tags.

Server-side I have a Talis Platform store. The desktop app communicates with the server by POSTing a chunk of the domain-specific information expressed as an RDF/XML doc – the stores have this kind of interface out of the box.

For my simple Web view of the data, I have a little bit of PHP which does a SPARQL query on the store (standard SPARQL-over-HTTP endpoint also comes out of the box) and uses XSLT to transform it into the JSON consumed by SIMILE’s Timeline viewer.

Unfortunately I broke the Timeline viewer bit of the app (I think I got out of sync with SIMILE’s scripts). But hopefully you get the idea – small domain-specific components, loosely-coupled using a standard general-purpose protocol (HTTP) and standards general-purpose data model (RDF).
For reuse, I can query the store however I like.
[I got distracted and forgot to link to implementation note: More Dogfood]

Ok, I’m showing my bias towards a data-oriented shared model in these comments. But if you wanted to narrow things down a little and be more content-oriented (and maybe placate Mr. Bray a litte), swap out the RDFisms and replace them with Atom/AtomPub. The key point is providing a common interface based on standard models, message formats and protocols. (Interop between Atom, RDF and any other systems which respect WebArch is generally doable because of that common interface).

One other point I’d like to add which I suspect speaks volumes about Google’s mentality is the difference between a real aeroplane and Google App Engine’s snazzy logo. Compare and contrast with the image above:

ecojet.png

A Chat with Dave Beckett

Today’s podcast is an interview with Dave Beckett (blog), Software Architect at Yahoo!

dajobe

Dave’s been a contributor to the Semantic Web initiative since before it had that name, originally coming from a background in parallel computing. As well as having worked on many of the key specifications around RDF, he’s responsible for the Redland toolkit, a comprehensive set of open source libraries for RDF. Dave maintains Planet RDF, an aggregation of Semantic Web blogs, as well as various tools in support of Semantic Web Interest Group (SWIG) communications. Until the quantity of material got out of hand, his RDF Resource Guide was the definitive collection. He derived the human-friendly RDF notation Turtle, which recently appeared as a W3C Team Submission, co-authored with Tim Berners-Lee. It was Dave, as a member of the Data Access Working Group (DAWG), that coined the acronym SPARQLSPARQL Protocol and RDF Query Language (which incidentally solved another a naming problem).

The topics covered include how he got involved in these technologies in the first place, Redland and a couple of Dave’s experiments: the triplr service (“Stuff in, triples out”) and Flickcurl, a C library for the Flickr API. He offers his thoughts around some of the technologies and specifications he’s been involved in, along with other developments around the Web – check the list of links below. While having limits on what he could say in public, he also mentioned the use of RDF inside Yahoo! (more announcements on the way apparently).

There are a couple of quotes I can’t resist pulling out. I asked Dave about how well he thought the Semantic Web was coming along, and he pointed out that, like the Web, there wouldn’t be any specific point in time at which one might say it was a success. But he added:

For me, in the work we’re doing with Yahoo! internally, it’s already a success…we’ve done work better, faster and we’ve done things we couldn’t do before because we were using this style of technology. It’s not always publicly visible because it’s a kind of data technology…but it’s a success for Yahoo! content and metadata problems I’ve been working on.

Dave also talks a little about open data, a nice line being:

The reason I got involved with the Semantic Web was…I wanted control of my data.

If you want to hear more, Dave will be speaking at the Semantic Technology Conference in San Jose in May, where he plans to go deeper into why Yahoo! is using RDF, the benefits and more detail of their projects.

One final quote:

Have fun with the Semantic Web…it’s about connecting things together,
about getting the jobs done.

During the conversation, we refer to the following resources;

Nitpicking Alex’s Semantic Web Patterns

Alex Iskold just published quite a lengthy blog post called Semantic Web Patterns: A Guide to Semantic Technologies. Overall it’s good stuff, and Alex has been doing a great job of promoting the Semantic Web over on Read/WriteWeb and elsewhere. He’s also one of the Semantic Gang featuring in the latest podcast series from oor Paul. (I’ve not listened to that yet – I’ll try it with a dogwalk shortly).

Because of all this I feel a little disloyal in being critical, but without clarification some of the points in Alex’s post could lead to misconceptions, the bane of Semantic Web outreach. One thing I can’t disagree with Alex about is the way the Semantic Web means different things to different people (cue elephant analogy). So with that proviso and all due respect etc, here we go:

1. Bottom-Up and Top-Down
Alex says:

“The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as-is, to derive meaning automatically.”

Ok, while one could (and I will) quibble the content of these definitions, they do make a pretty clear distinction. The only thing is, the phrases “bottom-up”/”top-down” have already been used fairly extensively already in the Semantic Web context to describe at least two different (but related) distinctions.

The first of these is with regard to decision-making, in the same sense as within the management hierarchy of an organization. The naive stereotype for this distinction would give, say, top-down = “those in power in standards orgs call the shots” versus bottom-up = “grassroots developers determine the direction”. Given that specifications can appear as authoritative rules, it’s easy to see how this perception might emerge. (This is a naive distinction, because it fails to consider the influence of the community that goes into defining specifications and in determining which survive the natural selection of deployment in the wild).

The second usage of “bottom-up”/”top-down” is more technical, in regard to how you arrive at your world/domain model. Top-down would be starting your model from a generalized level and works towards more specific levels, bottom-up the reverse. Clearly if there’s to be global interoperability, taking the top-down approach would imply there’s one true model that everyone follows. In the past this has led to some awful misconceptions around RDF, where people have assumed that the models (i.e. vocabularies, RDF Schemas, ontologies) are created on high – probably by the W3C. Quite the opposite is true. While RDF is a framework (and hence might be viewed as a top-level language), it’s essentially neutral on who, where and how domain models are created. Because things, classes of things, relationships between things and so on are identified using URIs, anyone can create their own vocabularies. This retains a base level of global interop, and enables web-scale independent development. (I once saw a list email containing a line like “the namespace begins with http://purl.org, so it must be something to do with RSS 1.0 people at the W3C” – no, no, no!).

So basically while Alex’s “bottom-up”/”top-down” may be internally consistent, it’s a little idiosyncratic.

2. Annotation Technologies: RDF, Microformats, and Meta Headers
There’s quite a bit I could quibble with in this section, but I’ll stick to the one point I think is most significant. It can be very misleading to think of RDF merely as an annotation and/or metadata tool. While it can be, and very often is, used for annotation (typically descriptions of documents) and metadata (descriptions of data) purposes, it is also used to talk about things directly. Alex provides an example: “Alex IS the father of Alice, Lilly, and Sofia”. This is plain old data. The same data could be expressed in an database table called “fatherOf” with “Alex” appearing three times in the left-hand column with the right-hand column containing “Alice”, “Lilly”, “Sofia”. RDF is a data technology, one big difference from traditional RDBMSs is that relations (tables, properties, “fatherOf”) can only two values – the subject and object of the relation (2 columns, “fathers”/”children”). Another big difference is that both things and the relationships between things are generally identified using URIs, which enables the Web part of the Semantic Web.

3. Consumer and Enterprise
I think it’s good that Alex highlights consumer/enterprise and vertical/horizontal aspects of the Semantic Web, they are worthy of discussion. But regarding the “killer app” of the Semantic Web – one might equally well ask “what is the killer app of the Web?” (this is Tim Berners-Lee’s own response in the 2001 Sci Am article).

There’s another source of misconceptions in this section: “RDF offers a way to communicate using XML-based language…”. While strictly speaking that’s probably correct, it gives the impression that RDF is XML-based, which it isn’t. RDF is a data model, an abstract language. Formats and serializations (of which there are several, both XML and non-XML) are secondary. Given the recent work around GRDDL, it’d be more accurate to say “XML offers a way to communicate using RDF-based language…”.

This confusion around XML messes up Alex’s arguments on scalability somewhat – I’m sure someone somewhere is using an XML DB for RDF, but most I’ve seen are either built on top of RDBMSs or are RDF-native. (Non-generic, domain-specific data can be stored pretty much any way you like – if semweb interfaces were exposed I suppose you could call it an RDF store of sorts…). Also while RDF storage technology isn’t any where near as mature as those of RDBMS, they do draw on essentially the same foundations – and sometimes the same people – so the picture isn’t as bad as one might imagine. Genuinely large RDF stores are starting to appear, and even then it’s worth remembering (as Alex points out) the aim is for the big database to be the Web itself. (My own standard line on this is that triplestores are just local caches of chunks of the Semantic Web).

4. Semantic APIs
As Paul Downey put it, Web APIs Are Just Web Sites – the same goes for the Semantic Web. Alex talks about some of the online APIs for extracting RDF from natural language. While these are nifty, potentially any Web site or service could with appropriate tweaking be a Semantic API. The original RSS was a Semantic API – descriptions of news-like items delivered using RDF over HTTP. While the latest syndication format, Atom, might not be RDF, it’s good Web-friendly data that can be mapped to RDF (work is in progress on conventions for that).

Semantic Web technologies also have an ace card up their sleeves here, in the form of SPARQL. RDF stores and (with the appropriate wiring) any online RDF can be queried using a straightforward SQL-like language, operating over standard HTTP. A seriously powerful addition to the Web API toolkit.

Right now the ability to make mashups (client- or server-side) is limited by the effort needed to integrate across different APIs (the n-squared thing). RDF can make integration trivial. Even without RDF/SPARQL being available, a lot of the pain of integration can be alleviated if the data is mapped to RDF then integrated.

I don’t think we’ll ever see every single service offering Semantic Web-friendly APIs. But to the Web 2.0 style sites, the Web is a competitive environment. Services which do support RDF and/or SPARQL will be able to benefit from the lowering of the integration barrier, and over time increasingly tend to have a commercial advantage over services which don’t. The ball is rolling and the field is wide open.

5. Search Technologies
“Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search.” – er, well, no. Search, at least as we know and love it today, is an artifact of the document Web. Success for the Semantic Web wouldn’t be improving search, but marginalizing it.

The information carried by the document Web, the stuff we’re interested in, is generally expressed in human-readable text inside the documents. There’s a semantic air gap between the protocols and languages of the current Web (HTTP, HTML…) and the information that’s being conveyed. Search engines bridge that gap through the use of heuristics based around string matching on queries and indexed documents. Semantic Web technologies offer a couple of ways of minimizing the gap. Through the increased use of metadata, more explicit matching can be made. Before anyone throws the metacrap arguments at me, consider the improvements already brought by metadata-rich syndication feeds and folksonomy tagging.

The other way of reducing the gap that comes to mind is…not to create gaps in the first place. Take an online train timetable. Right now it’ll likely be contained in a database somewhere, exposed through HTML with a form or two. To access the data we are at the mercy of whatever specific front-end the service provider has offered. To make a mashup with it we’d be making site-specific calls, at best through a RESTful API. But if the data was also available without the document Web-oriented intermediation, say as RDF/XML documents, or perhaps better still a SPARQL endpoint, mashups would be trivial.

Incidentally, I remember the train timetable scenario coming up on the microformats list a while back, at the time it seemed nonsensical to me to follow the suggestion over there of having e.g. one microformatted-HTML page for each record in the database. In retrospect I think that was potentially a very good solution – assuming the microformat followed best practices, using a profile etc, then this would be equivalent to publishing all the data as linked RDF. A GRDDL-aware consumer would in fact see it that way. The bonus advantage is having the (inherently in sync) HTML material available too.

Anyhow, back to search. The current Web does contain one notable kind of explicit, machine-readable semantics: the link. This page is related to that page. I don’t think it’s coincidence that the most successful search heuristic to date – Google’s PageRank – is based on this data source.

My standard line on search is “search engines act as indexes of the Web, the Semantic Web is its own index”, or more succinctly “the best way to find things is not to lose them in the first place”.

6. Contextual Technologies
I don’t really disagree with what Alex says in this section, but would add that Semantic Web languages make it much easier to deal with contexts – which can be expressed directly, without the need for interpreting natural language. There are already a few pretty neat faceted browsing tools around, I reckon these things are going to get a lot neater over the next few years.

7. Semantic Databases
See above about triplestores in Consumer and Enterprise.

Twine and Freebase are really nice applications, although I believe Freebase’s connection to the rest of the (Semantic) Web is still pretty suboptimal. Twine’s still in beta, but has already come an awful long way (I put it in my open-in-tabs-regularly bookmarks). What they both demonstrate is that something which looks to the end user like a regular shiny Web 2.0 application can be built at a significant scale using RDF/RDF-like technologies. Where these things have an opportunity to get much more interesting than similar traditional products is in exploiting the Semantic Web angle. I do hope they hook up to the Linking Open Data cloud soon.

Conclusion
The Semantic Web does mean different things to different people, and maybe I’m being overly orthodox in seeing RDF+HTTP as the distinguishing features of these particular Semantic Technologies. But I’m glad I got that off my chest. Now for that dogwalk with Semantic Gang.

A Chat with Richard Cyganiak

Latest recording on technical matters is a chat with Richard Cyganiak, who’s currently working on the Sindice Semantic Web search engine, though is probably best known for his leading role in the Linking Open Data project (maintaining the cloud diagram :-)

In the podcast Richard describes various technical details of these projects, and talks about the nature of data on the Web in the wild, as RDF, microformats and increasingly RDFa. He also discusses some of the practical issues in mapping existing databases to the Semantic Web (the kind of techniques Tim Berners-Lee mentioned in his podcast
with Paul a few weeks ago).

Richard naturally mentions the principles of Linked Data :

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

Listen Now

Download MP3 [47 mins, 44Mb]

A Chat with Tom Morris

Today’s verbal delight features Semantic Web hacker (and philosopher) Tom Morris, initially talking about using XML to describe real-world things, mentioning the advantages of RDF. He then describes his experiences with the Ruby programming language, and offers thoughts on practical aspects of working in the distributed environment of the Web. Tom tells of ideas he has around using Bluetooth with RDF, before giving his opinion of platforms like Facebook, and related novel aspects of online gaming. He concludes by talking about his recent experience of organizing SemanticCamp London, and encouraging other people to try the BarCamp approach to conferences.

Listen Now

Download MP3
[52 mins, 48Mb]

During the conversation, we refer to the following resources: