Nodalities

From Semantic Web to Web of Data
Nodalities

Subscribe

  • Any Podcatcher
  • Any Feed Reader

Categories

Archives

License

Creative Commons License

Author Archive

Lessons for Ontology Writers

I’m in a session called Taming the Open World, being run by Tim Swanson of Semantic Arts. I’m particularly interested in understanding how we can develop open world applications and issue 2 of Nodalities contains an article by Nadeem Shabiron just this issue. Since I have power and wifi which are both in scarce supply, I thought I’d take the opportunity to liveblog the session. Looks like it’s going to be a contrast to the Metaweb/Freebase tutorial earlier since we’re straight into OWL ontologies, running through the notation he’s going to be using. There are no standard notations for diagramming ontologies which is pretty surprising considering the wealth of research activity in this area. It’s looking rather interesting… must see if I can find some examples on the web.

First example is of two classes: Contractor and Employee and a single instance “Joe” who is an Employee. A reasoner will actually create two interpretations of these three facts: one where Joe is an Employee but not a Contractor and one where Joe is both. As more assertions are added, exponentially more interpretations are generated by the reasoner for all possible combinations. The reasoner is looking at all possible solutions and will assume that any fact is potentially true unless it has been explicitly told otherwise. This is the open world assumption - anything unknown could be true or false and a reasoner has to consider both possibilities.

A fact is provable if it is true in every possible interpretation. It is satisfiable if it is true in at least one model. These are the two main uses of a reasoner: to prove a statement or to discover if a statement is possible. However the huge number of possible interpretations massively complicates the problem. To make reasoning problems tractable we have to clean up the open world by removing facts that cannot possibly be true.

Some techiniques that can be used when writing ontologies:

  • Class disjointness - saying that two classes have no members in common such as a Living Thing and a Scheduled Event. This is especially useful with deep ontologies where the root classes are declared disjoint. This disjointness then cascades down the ontology tree to the more specific classes at the bottom eliminating many possible interpretations.
  • Domain and range - this makes the disjointness more effective by adding more information about instance types.
  • Individual differentness - OWL provides a differentFrom predictate but you have to say every individual is different from every other one by one. Some of this can be inferred using functional properties, so if two individuals have different values for a functional property then they can be inferred to be different individuals. Also we can use inverse functional properties but this is not possible with datatype properties, e.g. social security numbers. A workaround is to create a URI scheme for the value.

Some more advanced techniques include stating that an individual does not have a particular property. To do this you have to create a class for the individual resource and define that class as the complement of things that have the property in question. You have to do that for every individual, a massive explosion of triples, but a corresponding reduction in possible interpretations.

In the discussion after the session a few reasoner implementors were discussing some of these ideas. I learnt that a tableau reasoner will take all the URIs in a graph and combine them all to create all possible triples and then start eliminating them using the OWL constraints! I wonder what implications that has for Linked Data’s assignment of URIs to everything?

Working in the open world enforces a different kind of discipline in data modelling. You need to define what is not true as well as what is true. It’s best to work at the highest level possible which ends up being a supporting case for upper ontologies.

The Best is Yet to Come

The next generation of the web, this Semantic Web, is in its infancy but already we’re seeing some fantastic glimpses of its potential.

We saw some of that potential recently at DrupalCon 2008 where Dries Buytaert used his keynote to share a vision of the future… one that is built on RDF (read more on our sister blog).

Imagine every Drupal installation as a Linked Data source. Wow!

This would be a massive step towards the Semantic Web’s maturation and I hope the Drupal people can pull it off. My advice would be to remember that these are still early days and to tackle it with pragmatic baby steps. Just like the early days of the Web there’ll be plenty of stop energy trying to drag you back, but hold your nerve and see it through.

Adoption of the technologies by significant projects like Drupal really shows that we’re entering a new generation of the Web, one that is much more data-centric. The few billion triples online right now are just a drop in the ocean of what we’ll need for a useful Semantic Web so this news from Drupal is hugely important.

It feels like the Web did back when being able to launch a website on Geocities was a liberating experience. There wasn’t much to link to back then either but fifteen years later the Web is unrecognisable in terms of its diversity and effect on the World. I’m willing to bet that in another decade it will have changed again way beyond our expectations and predictions today.

The best is yet to come.

Web 3G: The Third Generation of the Web

I’m at the BlogTalk conference in Cork where I’m meeting an eclectic mix of bloggers, technologists and “Interesting People” gathering to share a common interest in the social web. There’s also a good representation from the Semantic Web folks including a group from DERI Galway.

Paul gave a talk on the potential of the web of relationships, alluding to the possibilities we’re seeing the more things become connected. It’s not just about connecting pages together with hyperlinks but using Semantic Web technologies we can also connect people with the things they produce, need and use. Tomorrow Nova Spivack is giving a talk on semantic social software, hopefully giving us a new view of his company’s application Twine.

Twine, and our own Talis Engage are the first in a new breed of applications founded on Semantic Web technologies that expose large parts of their data for reuse by other similar applications. We were discussing all this over dinner tonight and I suggested that a good label for this would be Web 3G since these applications were part of what we were calling the third generation of the Web.

Web 3G is what happens when you fuse the social participation of Web 2.0 with the decentralized structured information of the Semantic Web. The result is a smarter way of organising information in a network of interwoven semantic links and content, enhanced with feedback from usage and participation. We’re coming up to the end of two decades of the Web, the first of which was spent seeding the bare essentials of the web of documents. The second decade saw widespread broadband adoption enable mass participation and creation of content by millions. The next decade is going to radically change how we find, create, use and relate to that information.

Three generations of the Web

The Web right now is built from the generic hyperlink, which says nothing more than “look over here”. But even this weak semantic was enough to enable Google’s Pagerank to organise and score the Web. Imagine how much more powerful the hyperlink could be if it were possible to express sentiment or meaning in the link. Even if that were limited to positive or negative endorsement of the target of the link, the value to the relevance ranking of search engines and applications would be huge. However, the possibilities for expressing the intention of a link between two pages are endless. For example, it could be possible for writers to say whether they support or reject the views expressed in the target of the link, or whether they are linking to conflicting evidence or alternative versions of the same information. These simple expressions of intention could provide an entirely new dimension of metadata. The links between things are fundamental to the existence of the Web and the value of understanding why things are related is huge.

Web 3G is an evolution of Web 2.0 enhancing it through the appropriate use of light semantics. Links between things become more clearly typed, embedded data on pages becomes more easily understood by machines, all the while retaining the ability for people to connect and link and critique the quality and relevance of the data. It becomes the semantic graph, open to participation by everyone without having to ask anyone’s permission. It is not Artificial Intelligence, there are no formal ontologies or logic reasoning, but some of the tools and techniques of AI are needed: neural networks, classifiers, heuristics, Bayesian networks and statistical analysis.

A whole new generation of applications are emerging that feature huge levels of interconnections and we hope to enable many of those to be built using the Talis Platform. Many of these connections will be internal to the application but by exposing raw data, in the ways suggested by the Linking Open Data project, every application can link to and reuse information managed by every other application. This is a step beyond data portability: rather than copying data from one application to another the norm will be to reuse data in situ. That way the data never gets out of date because it’s shared and we can use the best application to manage each piece of our data, depending on our situation. This is what Tim Berners-Lee meant by the Giant Global Graph: a world-wide network of links with meaning.

I like this generational view of the evolution of the Web. It makes it clear that there is no big bang switchover from one type of application to another. Even now we can see many Web applications being created and used that aren’t socially enabled, but they look hollow when compared to their Web 2.0 peers. The is likely to be true of the third decade, where we’ll see new applications being created that can’t talk to their peers and they too will feel shallow and unexciting when compared to their Web 3G counterparts. This isn’t an increment to Web 2.0, it’s a radical step forward!

Semantic Spring

It’s going to be a busy spring in Semantic Web land. There are five important conferences that are scheduled in a two month period.

First up is WWW2008 in Beijing which runs from the 21st to the 25th of April. This is the pre-eminent Web conference packed full of presentations, workshops and tutorials covering everything webby. The Semantic Web traditionally has a number of tracks running through the conference programme. This year we’re co-chairing a session called Linked Data on the Web with Tim Berners-Lee, Chris Bizer and Kingsley Idehen. Rob submitted a paper to the workshop on representing MARC in RDF which I recommend to anyone interested in the future of library data. I’ve seen a couple of the other submissions and it looks like it’s going to be a lot of fun.

Then there’s a week off until XTech 2008 which runs from the 6th to the 9th of May in Dublin. XTech is another webby conference that often has a Semantic Web streak through it. Many of the people involved in organising and helping to run it are SemWeb sympathisers. A couple of us at Talis have submitted talk proposals although we won’t find out if we’ve been selected until the end of this month. XTech clashes completely with JavaOne in San Francisco which also has a little Semantic Web session courtesy of Henry Story and friends.

Another break of a week before Semantic Technology in San Jose from 19th to 22nd May. This is the fourth year for this conference whose focus is coming from the world of semantics rather than the Web. Last year saw all kinds of attendees from the CIA and NASA to Lockheed and Citigroup. This is the conference to be at if you want to see commercial applications of the SemWeb.

Again, a week off until we reach ESWC 2008 in Tenerife from the 1st to the 5th June. This is a hardcore SemWeb conference with a strong research angle. Once again we’re involved in a organising a workshop: Scripting for the Semantic Web, which aims to cover the use of the Semantic Web with all kinds of scripting languages such as PHP, Perl, Python, Ruby and JavaScript. I’m also looking forward to the first workshop on Collective Intelligence & the Semantic Web which promises to be a very hot topic in coming years.

One more week off, before we reach the new kid on the block: Linked Data Planet being held in New York on the 17th and 18th of June. Details are sketchy at the moment apart from the keynote from Tim Berners-Lee, but Linked Data has a lot of momentum and there is a lot of potential for some cool demonstrations.

As you might expect a contingent of Talisians are attending each conference, but none of us are going to attempt all five! Personally I’m planning to be at XTech and Semantic Technology and possibly at ESWC and Linked Data. I know for sure that I’m glad of the holiday I have booked for the end of June. I think I’m going to need it

Rules for a Realistic Semantic Web?

In one of my linkblog entries earlier this week I made the following claim:

IMHO OWL isn’t part of the petatriple future of the semweb. Nor is SPARQL…

A recent post by Chimezie touched on this too:

I’ve been spending quite a bit of time on FuXi mainly because I am interested in empirical evidence which supports a school of thought which claims that Description Logic based inference (Tableaux-based inference) will never scale as well the Logic Programming equivalent - at least for certain expressive fragments of Description Logic (I say expressive because even given the things you cannot express in this subset of OWL-DL there is much more in Horn Normal Form (and Datalog) that you cannot express even in the underlying DL for OWL 1.1). The genesis of this is a paper I read, which lays out the theory, but there was no practice to support the claims at the time (at least that I knew of). If you are interested in the details, the paper is “Description Logic Programs: Combining Logic Programs with Description Logic” and written by many people who are working in the Rule Interchange Format Working Group.

It is not light reading, but is complementary to some of Bijan’s recent posts about DL-safe rules and SWRL.

A follow-up is a paper called “A Realistic Architecture for the Semantic Web” which builds on the DLP paper and makes claims that the current OWL (Description Logic-based) Semantic Web inference stack is problematic and should instead be stacked ontop of Logic Programming since Logic Programming algorithm has a much richer and pervasively deployed history (all modern relational databases, prolog, etc..)

I’m not a DL expert but, based on my research, it seems to that DL based inference for OWL isn’t going deliver for the semantic web any time soon. Of course, by this I mean it’s not going to scale in such a way that makes real-time inferencing over petatriples viable. Besides, OWL and its variations are still very limited in their expressivity and not particularly useful for many classes of applications. Maybe rule systems can deliver instead?

Mashup on Location

I spent the evening at the Mashup* event in London. Tonight’s theme was location and specifically location-based services. We were treated to a long sales pitch from TeleAtlas as the opener which was followed by a rather lacklustre panel session. This was a shame because the others have been pretty good (such as the one on identity)

The evening was saved, in my opinion, by Tom Heinersdorff who posed a question about passive data matching in the middle of the panel session and enquired whether any of the panel were working on such a system. The example he gave was of someone walking past a supermarket which happened to be broadcasting information about a special offer on chocolate which the person’s phone could detect and alert them if it knew they liked chocolate. Unfortunately this useful question got buried by the moderator who used it as an opportunity to talk about how he’d heard that Tesco gather so much information on its customers that it can tell that a woman is pregnant before she knows herself. The audience laughed and moved straight onto the next question, leaving Tom’s point behind.

Which was a pity, because when I spoke to him after it turns out that there is a very compelling privacy story in this passive matching idea. Most location based systems fall down here because they require the user’s device to transmit its location which, in the age of data protection acts, can’t be used without the user’s consent which often requires them signing a form. Putting any sort of barrier in front of this data simply reduces its uptake and hence usefulness. The whole issue isn’t helped by the mobile operators’ insistence on charging extortionate sums for access to the APIs around this data.

The beauty of Tom’s suggestion is that there is no transmission of personal information between the vendor and the consumer. Thus it works without all that hassle of getting consumer consent because no private data actually leaves the user’s mobile device and the user is fully in control of the experience. I can imagine this working in other domains too, such as online advertising. In that case the ad server would suggest some pertinent advertisments based on the page content and the user’s browser would select which ones might be of interest to them. The browser would be configurable with user preferences or perhaps adaptive over time.

It’s not a win-win scenario though since the advertisers and marketers don’t get their hands on that lovely profile data that they so cherish. However, it’s a big win for consumer privacy.

Apparently the next Mashup event is on TV 2.0 so maybe we’ll see some Joost or miniweb presence.

Web Application Authentication

Google just launched their Account Authentication mechanism:

Google Accounts authentication for web-based applications allows the application to access a Google service protected by a user’s Google account. To maintain a high level of security, the Authentication Proxy interface, AuthSub, enables the application to get an authentication token without ever handling the user’s account login information. Using the proxy, the user of the web application logs into their account through a Google-supplied login page and consents to grant limited access to the web application.

This comes while a post from Dare Obasanjo was fresh in my mind:

The devil is in the details when talking about authentication, authorization and Web APIs. When I first heard about the Yahoo’s proposed authentication model for Web APIs at their ETech 2006 talk entitled Building a Participation Platform: Yahoo! Web Services Past, Present, and Future, I thought it sounded similar to the model used by Passport Windows Live ID. In both approaches instead of applications prompting users for their credentials (username/password combo), the user signs in to the primary service which then returns an opaque token to the target application that identifies the user and gives the application permission to access the user’s data. However, having a fine grained access that can give applications access only specific services and can revoke permission given to specific applications seems to be richer than what I’ve seen offered by Passport Windows Live ID. This is nice but it’s to be seen how easy this will be for users to understand or for applications to manage.

Dare then goes on to define two characteristics of web application authentication that he sees as essential:

User credentials are sacred and must be protected at all costs: A security mechanism is only as strong as its weakest link. This means that it is extremely unwise to build an authentication model that has applications built on your APIs to request username/passwords or other credentials from users directly

and

Do not discriminate against any platform or any device: In todays world, end users interact with online services using a variety of devices and platforms. Each device and platform has different strengths and limitations but is important in its own right.

As far as I can tell, Google’s authentication appears to satisfy both points, provided you read Dare’s words as meaning “don’t discriminate so long as the platform or device can speak HTTP”. The Google approach is almost identical to the established Flickr authentication API, the only functional difference being that Flickr returns the login page and consent form in two steps rather than Google’s single step. Google also supports secure access using certificates which is a welcome addition.

The Google site includes this diagram of the interactions which at first glance would suggest that the web application somehow asks the Google service to contact the user directly, which of course is unlikely in the web architecture:

Authsub_sml.png

I drew my own diagram of the interactions taking place which I think clarifies the situation. The web application redirects the user to Google’s service, passing along the URI that it wants Google to send the user back to once they’ve been authenticated. In this final redirection of the user’s request Google includes a one-off token which the application can use to get a longer duration session key for use with other Google services. This is exactly the same model as Flickr’s, who call the initial token a “frob”.

I’m following development in this space very closely and I’m very encouraged to see two almost identical authentication procedures adopted by these companies. All we need now is a third and we’ve probably got enough for a de-facto standard, which with a bit of will and wrangling could become a nice little IETF draft.

Update: in the time I spent thinking about and writing this post Google appear to have pulled the API completely. Hopefully it’ll return shortly.

Technorati Tags: , ,

Sparql Clipboard

Benjamin Nowack has produced an intriguing demonstration of a live web clipboard with a twist. The twist is that the data to be copied isn’t embedded in the web page, instead it there’s a reference to a Sparql server from which that data can be obtained. When you copy a snippet using your browser’s normal clipboard function a unique identifier for the snippet and a link to the Sparql service are copied. When you subsequently paste, the code passes the identifier to the service and it’s the results of that lookup that are pasted. This really is a novel idea and would certainly work extremely well with our directory which provides a Sparql lookup for each resource listed. Even better, Benjamin’s demo uses embedded RDF to describe the copyable snippets

Technorati Tags: , ,

Embedded RDF

The past couple of weeks has seen a burst of activity around Embedded RDF, our method for embedding a subset of RDF into web pages. Earlier this month, I presented eRDF to an audience at XTech 2006 (slides here), the aim of which was to explain more clearly the benefits of the eRDF approach. I’m quite pleased with how it went, especially because that afternoon I encountered Leigh Dodds busy building eRDF support into his XML Army Knife service. Now Sparql queries run through his site can be targetted at HTML pages containing Embedded RDF. For example, here’s a query that lists the blogs I write for. However, instead of the query operating on some separately published RDF this is using the same HTML page that you see when you visit my home page. This is quite an awesome view of the next generation web of data, where the web we know and love becomes a friendly place for machines too.

Then, over the weekend, Benjamin Nowack announced an eRDF parser written in PHP. I’ve had some great feedback from Benjamin over the past few weeks as he first started learning the ins and outs of eRDF, then began implementing it. I have a number of errata to incorporate into the main specification and I also want to start exploring some of the new ideas Benjamin has around using owl:sameAs to enable eRDF to embed metadata about other documents.

Technorati Tags: , ,

Semantic Web 2.0

I’m on my way to the XTech conference in Amsterdam. The schedule looks to be packed full of fascinating topics around open data, semnatic web and web 2.0 - all areas of extreme importance to me. It’s good to see some strong RDF work being showcased such as Ingenta’s huge data store and the BBC’s new programme catalogue.

Some of the presentations I hope to see include:

What an amazing line up of speakers! I’m honoured to have been given the opportunity to present the work I’ve been doing around embedding RDF encoded metadata into web pages using idiomatic HTML. It’s a solution that works now and is backwards compatible with long-standing Dublin Core conventions for adding metadata to pages. It also plays nicely with upcoming technologies such as GRDDL and co-exists happily with microformats in the same page. I’m also chairing sessions on Search engines for Semantic Web knowledge, Building the Semantic Web at NASA, The End of the Open Internet?: Network Service and Security in Web 2.0 and Semantics Through the Tag, the last by Dave Beckett who gave me a sneak preview at the Jena User Conference last week. He’s doing some innovative stuff around the semantics of tagging. Cool to see a big silicon valley company taking on RDF and the Semantic Web.

But, before that, I’ll be taking part in the Ajax lightning demo session tonight. I’m going to show our Library 2.0 demonstrator which illustrates how applications can be simply composed of diverse web services. I rarely link to it because it makes a lot more sense with the narrative that I use when demoing it. So, if you’re at XTech this year, I encourage you to come along tonight to see it being demoed. If you can’t, then have a play anyway. The best search terms are ones that result in books with ISBNs, which generally means words invented in the past twenty years: javascript, google, george w. bush

Technorati Tags: , , , , ,