Nodalities

From Semantic Web to Web of Data
Nodalities

Updates

Follow us on:

Categories

Archives

License

Creative Commons License

Author Archive

Linked Data and the Public Domain

We love data at Talis and we want as much of it to be freely reusable as possible. In fact, because we wanted to see even more reusable data we recently launched the Talis Connected Commons offering completely free hosting of public domain data. We believe that dedicating data to the public domain is the best way to ensure that data is universally reusable and remixable. When data is public domain it means that it can be reused automatically without needing to check terms and conditions or track the source of every statement to provide attribution. These kinds of things act as friction to reuse, wasting energy that could be better spent creating inspiring things.

We also firmly believe that, in the future, there will a significant role for other forms of data licensing, including commercial access. We will support those efforts too when the time comes but today the Linked Data web needs more and better data that is freely accessible.

Licensing vs Waivers

You are probably familiar with the process of licensing a creative work, most likely through the great job that Creative Commons have been doing in recent years. However, the concept of waivers is less well known but highly relevant to reuse of linked data.

Whenever you create something you have automatic rights over it granted to you. The best known of these rights is copyright, which gives you the exclusive right to make copies of your creative work. There are many other rights which can be held over intellectual property such as design rights, trade marks, registered designs, performers rights, trade secrets, database rights, publication rights and many more.

Licensing is the process of granting others limited use of rights you possess. For example, when you license your copyright you are granting specific people a limited right to make copies without having to ask you first. Licensing of one right does not affect your possession of the others. For example you could grant the right to copy your work but retain the right to perform it. Creative Commons licenses are mostly concerned with copyright, but they do not usually deal with the other rights such as database rights or trade secrets.

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

The Licensing Problem

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

There isn’t a Creative Commons license for every possible right and there probably can’t be because of the huge variation in rights granted in different jurisdictions around the world. Also, when we start to look at licensing compilations of data we find that the situation becomes complex because you have to consider both the database and its contents seperately. For example a document of articles would be subject to database right over the whole collection and individual copyrights for each article, quite possible to many different owners. The Open Data Commons has addressed this particular example with its Open Database License and Database Contents License (based on work originally donated by Talis). If a standard license doesn’t exist then you need to hire lawyers and write one for yourself – a potentially huge cost.

Our collective goal for a successful Linked Data web has to be to protect consumers of the data: the people who are remixing many different sources of data. Our intentions may be very honourable, but people need certainty if they are to build enduring value on data. Creative Commons licenses are irrevocable so even if you lose control over your work through some misfortune, the people reusing it will be protected forever. Imagine this scenario: you allow people to use data you have collated but your company goes bankrupt and the rights to the data collection are sold by the liquidators. If you hadn’t licensed your rights explicitly then every one of your users could be liable to be sued by the new rights holder!

This is where waivers of rights can help. By explictly waiving your rights over your data then you are giving your users the best guarantee of safety that you can. Even if you lost control of the data collection subsequent owners could not persue your users because the rights you held have already been waived.

There are two waivers of rights that can be applied to datasets:

Both of these waivers can be used for data intended for submission to the Talis Connected Commons.

Community Norms

When you apply a waiver like CC0 you are relinquishing all your rights over the work to the fullest extent possible under the law. That means that you cannot force people to attribute you or stop them from making commercial use of your work.

The preferred approach is to attach a set of community norms to the work. These are like a code of conduct for use of the work and are usually self-policing. They are not legally enforceable but form part of the ethical or professional requirements for participating in a community. The best known example of community norms are the citation standards used in the academic commnity. Citing pre-existing work is not legally enforceable but those who abuse the norms can find themselves excluded from the academic community.

The Open Data Commons has published a set of attribution and share-alike norms which asks that users of the data:

  • Share work derived from the data.
  • Give credit to the original data publisher.
  • Point others at the source of the data.
  • Publish in open formats.
  • Avoid using digital rights management.

How to Declare Your Waiver

To delare your waiver in a machine readable way, you should first create a voID description of your dataset. VoID, or Vocabulary of Interlinked Datasets, is a vocabulary designed to describe key attributes of your dataset. We created a waiver RDF vocabulary that can be used with voID to declare any waiver of rights and the community norms around a dataset.

In this example we describe a dataset using the void:Dataset class and provide it with a dc:title as a minimal human readable description. You should add other descriptive properties as necessary (some suggestions can be found in the voID guide).

We then use the wv:waiver property (defined in the waiver RDF vocabulary) to link the dataset to the Open Data Commons PDDL waiver. We use the wv:declaration property to include a human-readable declaration of the waiver. This is purely informational, but can be immediately be used by a person examining the voID description. Finally we use the wv:norms property to link the dataset to the community norms we suggest for it, in this case the ODC Attribution and Share-alike norms.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/"/>
    <wv:norms rdf:resource="http://www.opendatacommons.org/norms/odc-by-sa/" />
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

Alternatively if you were to choose the CC0 waiver without any particular norms then you should use the following RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:wv="http://vocab.org/waiver/terms/"
  xmlns:void="http://rdfs.org/ns/void#">
  <void:Dataset rdf:about="{{uri of your dataset}}">
    <dc:title>{{name of dataset}}</dc:title>
    <wv:waiver rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/"/>
    <wv:declaration>
      To the extent possible under law, {{your name or organisation}} has waived all
      copyright and related or neighboring rights to {{name of dataset}}
    </wv:declaration>
  </void:Dataset>
</rdf:RDF>

These examples show that it is very simple to declare your waiver. However, before you do so be sure to read carefully what rights you are irrevocably giving up. For example you would most likely be waiving your publicity and privacy rights, so if your image is included in the dataset you could not later complain that someone is using it in a way you do not approve of. If you are worried about how your work will be used, if you want to legally require attribution, or if you don’t want people to make money off of your work, then you should not use a waiver and instead seek legal advice on the creation of a data license specific to your needs.

Growing the Web of Data with Data Incubator

At Talis we’re huge fans of Linked Data, especially when it’s freely available for reuse too. However, we also realise that not everyone has been smitten by the Linked Data bug yet so we’re always thinking about new ways to help others use, publish and discover the benefits of connecting their data together.

Recently we were wondering how we could help organise the skill and expertise of people who love Linked Data to show data publishers how their data could be even more useful and effective. As the Linking Open Data project has shown, actions speak louder than words so we wanted to do something with practical and visible results.

One problem we face is that until it is available in open and reusable formats it’s not possible to show data owners the power locked up in their own data. Conversely it is hard for the data owner to justify investment in opening up their data without concrete demonstrations of that power. A classic deadlock situation! The goal of our new project is to break this deadlock. We plan to do this by organising people around popular datasets to create mappings to RDF, write conversion code and openly publish the resulting data. The result will be a huge reduction in the investment needed by the data owner: they can simply adapt the work and emit the Linked Data themselves.

We call our new project the Data Incubator and if you love Linked Data then we encourage you to join in and help grow the web of data. Although this project is entirely independent of Talis, we are supporting it through the Talis Connected Commons scheme, providing free hosting and services for public domain data.

Already we have started projects to convert the Open Library dataset including much-loved books such as The Hobbit and to convert journal metadata provided by CrossRef, Highwire and the National Library of Medicine. Many more projects are being incubated and we are discussing how we create a repeatable process for contacting and encouraging data owners to take part.

Join the Data Incubator mailing list and get involved.

SWIG-UK

Tomorrow a group of us are off to visit Bristol for the SWIG-UK meetup that HP Labs are kindly hosting. Leigh is giving a talk on using the Talis Platform to publish data and I am running a lightning talks session which should be fun and, hopefully, informative. This time there is a single track with some top quality content which makes things a lot simpler. It should be a good day with lots of time to meet people and catch up with the vast amount of things going on in the Semantic Web space.

Welcoming Leigh Dodds to Talis

I’m proud and excited to announce that from the 1st September Leigh Dodds will be joining Talis as our Platform Programme Manager. His background, experience and skill makes him an ideal candidate to develop and advance our Platform ideas. Over the years Leigh has made many contributions to the development of the Semantic Web with a particular emphasis on the Web and REST. He has written extensively on these subjects for O’Reilly media and on his blog. In fact Leigh’s writings on REST were very influential in the design of our Platform APIs. I first collaborated with him back in 2000 as part of the RSS 1.0 working group when RDF was barely a year old. Shortly after that he developed the FOAF-a-matic which has probably done more to advance adoption of FOAF and RDF than any other application. Most people’s first introduction to FOAF has been via software created by Leigh. He has also been a regular face at XTech and other conferences presenting on topics such as SPARQL and connecting social content. We’re all eagerly anticipating learning from and working with Leigh. Welcome aboard!

Lessons for Ontology Writers

I’m in a session called Taming the Open World, being run by Tim Swanson of Semantic Arts. I’m particularly interested in understanding how we can develop open world applications and issue 2 of Nodalities contains an article by Nadeem Shabiron just this issue. Since I have power and wifi which are both in scarce supply, I thought I’d take the opportunity to liveblog the session. Looks like it’s going to be a contrast to the Metaweb/Freebase tutorial earlier since we’re straight into OWL ontologies, running through the notation he’s going to be using. There are no standard notations for diagramming ontologies which is pretty surprising considering the wealth of research activity in this area. It’s looking rather interesting… must see if I can find some examples on the web.

First example is of two classes: Contractor and Employee and a single instance “Joe” who is an Employee. A reasoner will actually create two interpretations of these three facts: one where Joe is an Employee but not a Contractor and one where Joe is both. As more assertions are added, exponentially more interpretations are generated by the reasoner for all possible combinations. The reasoner is looking at all possible solutions and will assume that any fact is potentially true unless it has been explicitly told otherwise. This is the open world assumption – anything unknown could be true or false and a reasoner has to consider both possibilities.

A fact is provable if it is true in every possible interpretation. It is satisfiable if it is true in at least one model. These are the two main uses of a reasoner: to prove a statement or to discover if a statement is possible. However the huge number of possible interpretations massively complicates the problem. To make reasoning problems tractable we have to clean up the open world by removing facts that cannot possibly be true.

Some techiniques that can be used when writing ontologies:

  • Class disjointness – saying that two classes have no members in common such as a Living Thing and a Scheduled Event. This is especially useful with deep ontologies where the root classes are declared disjoint. This disjointness then cascades down the ontology tree to the more specific classes at the bottom eliminating many possible interpretations.
  • Domain and range – this makes the disjointness more effective by adding more information about instance types.
  • Individual differentness – OWL provides a differentFrom predictate but you have to say every individual is different from every other one by one. Some of this can be inferred using functional properties, so if two individuals have different values for a functional property then they can be inferred to be different individuals. Also we can use inverse functional properties but this is not possible with datatype properties, e.g. social security numbers. A workaround is to create a URI scheme for the value.

Some more advanced techniques include stating that an individual does not have a particular property. To do this you have to create a class for the individual resource and define that class as the complement of things that have the property in question. You have to do that for every individual, a massive explosion of triples, but a corresponding reduction in possible interpretations.

In the discussion after the session a few reasoner implementors were discussing some of these ideas. I learnt that a tableau reasoner will take all the URIs in a graph and combine them all to create all possible triples and then start eliminating them using the OWL constraints! I wonder what implications that has for Linked Data’s assignment of URIs to everything?

Working in the open world enforces a different kind of discipline in data modelling. You need to define what is not true as well as what is true. It’s best to work at the highest level possible which ends up being a supporting case for upper ontologies.

The Best is Yet to Come

The next generation of the web, this Semantic Web, is in its infancy but already we’re seeing some fantastic glimpses of its potential.

We saw some of that potential recently at DrupalCon 2008 where Dries Buytaert used his keynote to share a vision of the future… one that is built on RDF (read more on our sister blog).

Imagine every Drupal installation as a Linked Data source. Wow!

This would be a massive step towards the Semantic Web’s maturation and I hope the Drupal people can pull it off. My advice would be to remember that these are still early days and to tackle it with pragmatic baby steps. Just like the early days of the Web there’ll be plenty of stop energy trying to drag you back, but hold your nerve and see it through.

Adoption of the technologies by significant projects like Drupal really shows that we’re entering a new generation of the Web, one that is much more data-centric. The few billion triples online right now are just a drop in the ocean of what we’ll need for a useful Semantic Web so this news from Drupal is hugely important.

It feels like the Web did back when being able to launch a website on Geocities was a liberating experience. There wasn’t much to link to back then either but fifteen years later the Web is unrecognisable in terms of its diversity and effect on the World. I’m willing to bet that in another decade it will have changed again way beyond our expectations and predictions today.

The best is yet to come.

Web 3G: The Third Generation of the Web

I’m at the BlogTalk conference in Cork where I’m meeting an eclectic mix of bloggers, technologists and “Interesting People” gathering to share a common interest in the social web. There’s also a good representation from the Semantic Web folks including a group from DERI Galway.

Paul gave a talk on the potential of the web of relationships, alluding to the possibilities we’re seeing the more things become connected. It’s not just about connecting pages together with hyperlinks but using Semantic Web technologies we can also connect people with the things they produce, need and use. Tomorrow Nova Spivack is giving a talk on semantic social software, hopefully giving us a new view of his company’s application Twine.

Twine, and our own Talis Engage are the first in a new breed of applications founded on Semantic Web technologies that expose large parts of their data for reuse by other similar applications. We were discussing all this over dinner tonight and I suggested that a good label for this would be Web 3G since these applications were part of what we were calling the third generation of the Web.

Web 3G is what happens when you fuse the social participation of Web 2.0 with the decentralized structured information of the Semantic Web. The result is a smarter way of organising information in a network of interwoven semantic links and content, enhanced with feedback from usage and participation. We’re coming up to the end of two decades of the Web, the first of which was spent seeding the bare essentials of the web of documents. The second decade saw widespread broadband adoption enable mass participation and creation of content by millions. The next decade is going to radically change how we find, create, use and relate to that information.

Three generations of the Web

The Web right now is built from the generic hyperlink, which says nothing more than “look over here”. But even this weak semantic was enough to enable Google’s Pagerank to organise and score the Web. Imagine how much more powerful the hyperlink could be if it were possible to express sentiment or meaning in the link. Even if that were limited to positive or negative endorsement of the target of the link, the value to the relevance ranking of search engines and applications would be huge. However, the possibilities for expressing the intention of a link between two pages are endless. For example, it could be possible for writers to say whether they support or reject the views expressed in the target of the link, or whether they are linking to conflicting evidence or alternative versions of the same information. These simple expressions of intention could provide an entirely new dimension of metadata. The links between things are fundamental to the existence of the Web and the value of understanding why things are related is huge.

Web 3G is an evolution of Web 2.0 enhancing it through the appropriate use of light semantics. Links between things become more clearly typed, embedded data on pages becomes more easily understood by machines, all the while retaining the ability for people to connect and link and critique the quality and relevance of the data. It becomes the semantic graph, open to participation by everyone without having to ask anyone’s permission. It is not Artificial Intelligence, there are no formal ontologies or logic reasoning, but some of the tools and techniques of AI are needed: neural networks, classifiers, heuristics, Bayesian networks and statistical analysis.

A whole new generation of applications are emerging that feature huge levels of interconnections and we hope to enable many of those to be built using the Talis Platform. Many of these connections will be internal to the application but by exposing raw data, in the ways suggested by the Linking Open Data project, every application can link to and reuse information managed by every other application. This is a step beyond data portability: rather than copying data from one application to another the norm will be to reuse data in situ. That way the data never gets out of date because it’s shared and we can use the best application to manage each piece of our data, depending on our situation. This is what Tim Berners-Lee meant by the Giant Global Graph: a world-wide network of links with meaning.

I like this generational view of the evolution of the Web. It makes it clear that there is no big bang switchover from one type of application to another. Even now we can see many Web applications being created and used that aren’t socially enabled, but they look hollow when compared to their Web 2.0 peers. The is likely to be true of the third decade, where we’ll see new applications being created that can’t talk to their peers and they too will feel shallow and unexciting when compared to their Web 3G counterparts. This isn’t an increment to Web 2.0, it’s a radical step forward!

Semantic Spring

It’s going to be a busy spring in Semantic Web land. There are five important conferences that are scheduled in a two month period.

First up is WWW2008 in Beijing which runs from the 21st to the 25th of April. This is the pre-eminent Web conference packed full of presentations, workshops and tutorials covering everything webby. The Semantic Web traditionally has a number of tracks running through the conference programme. This year we’re co-chairing a session called Linked Data on the Web with Tim Berners-Lee, Chris Bizer and Kingsley Idehen. Rob submitted a paper to the workshop on representing MARC in RDF which I recommend to anyone interested in the future of library data. I’ve seen a couple of the other submissions and it looks like it’s going to be a lot of fun.

Then there’s a week off until XTech 2008 which runs from the 6th to the 9th of May in Dublin. XTech is another webby conference that often has a Semantic Web streak through it. Many of the people involved in organising and helping to run it are SemWeb sympathisers. A couple of us at Talis have submitted talk proposals although we won’t find out if we’ve been selected until the end of this month. XTech clashes completely with JavaOne in San Francisco which also has a little Semantic Web session courtesy of Henry Story and friends.

Another break of a week before Semantic Technology in San Jose from 19th to 22nd May. This is the fourth year for this conference whose focus is coming from the world of semantics rather than the Web. Last year saw all kinds of attendees from the CIA and NASA to Lockheed and Citigroup. This is the conference to be at if you want to see commercial applications of the SemWeb.

Again, a week off until we reach ESWC 2008 in Tenerife from the 1st to the 5th June. This is a hardcore SemWeb conference with a strong research angle. Once again we’re involved in a organising a workshop: Scripting for the Semantic Web, which aims to cover the use of the Semantic Web with all kinds of scripting languages such as PHP, Perl, Python, Ruby and JavaScript. I’m also looking forward to the first workshop on Collective Intelligence & the Semantic Web which promises to be a very hot topic in coming years.

One more week off, before we reach the new kid on the block: Linked Data Planet being held in New York on the 17th and 18th of June. Details are sketchy at the moment apart from the keynote from Tim Berners-Lee, but Linked Data has a lot of momentum and there is a lot of potential for some cool demonstrations.

As you might expect a contingent of Talisians are attending each conference, but none of us are going to attempt all five! Personally I’m planning to be at XTech and Semantic Technology and possibly at ESWC and Linked Data. I know for sure that I’m glad of the holiday I have booked for the end of June. I think I’m going to need it

Rules for a Realistic Semantic Web?

In one of my linkblog entries earlier this week I made the following claim:

IMHO OWL isn’t part of the petatriple future of the semweb. Nor is SPARQL…

A recent post by Chimezie touched on this too:

I’ve been spending quite a bit of time on FuXi mainly because I am interested in empirical evidence which supports a school of thought which claims that Description Logic based inference (Tableaux-based inference) will never scale as well the Logic Programming equivalent – at least for certain expressive fragments of Description Logic (I say expressive because even given the things you cannot express in this subset of OWL-DL there is much more in Horn Normal Form (and Datalog) that you cannot express even in the underlying DL for OWL 1.1). The genesis of this is a paper I read, which lays out the theory, but there was no practice to support the claims at the time (at least that I knew of). If you are interested in the details, the paper is “Description Logic Programs: Combining Logic Programs with Description Logic” and written by many people who are working in the Rule Interchange Format Working Group.

It is not light reading, but is complementary to some of Bijan’s recent posts about DL-safe rules and SWRL.

A follow-up is a paper called “A Realistic Architecture for the Semantic Web” which builds on the DLP paper and makes claims that the current OWL (Description Logic-based) Semantic Web inference stack is problematic and should instead be stacked ontop of Logic Programming since Logic Programming algorithm has a much richer and pervasively deployed history (all modern relational databases, prolog, etc..)

I’m not a DL expert but, based on my research, it seems to that DL based inference for OWL isn’t going deliver for the semantic web any time soon. Of course, by this I mean it’s not going to scale in such a way that makes real-time inferencing over petatriples viable. Besides, OWL and its variations are still very limited in their expressivity and not particularly useful for many classes of applications. Maybe rule systems can deliver instead?

Mashup on Location

I spent the evening at the Mashup* event in London. Tonight’s theme was location and specifically location-based services. We were treated to a long sales pitch from TeleAtlas as the opener which was followed by a rather lacklustre panel session. This was a shame because the others have been pretty good (such as the one on identity)

The evening was saved, in my opinion, by Tom Heinersdorff who posed a question about passive data matching in the middle of the panel session and enquired whether any of the panel were working on such a system. The example he gave was of someone walking past a supermarket which happened to be broadcasting information about a special offer on chocolate which the person’s phone could detect and alert them if it knew they liked chocolate. Unfortunately this useful question got buried by the moderator who used it as an opportunity to talk about how he’d heard that Tesco gather so much information on its customers that it can tell that a woman is pregnant before she knows herself. The audience laughed and moved straight onto the next question, leaving Tom’s point behind.

Which was a pity, because when I spoke to him after it turns out that there is a very compelling privacy story in this passive matching idea. Most location based systems fall down here because they require the user’s device to transmit its location which, in the age of data protection acts, can’t be used without the user’s consent which often requires them signing a form. Putting any sort of barrier in front of this data simply reduces its uptake and hence usefulness. The whole issue isn’t helped by the mobile operators’ insistence on charging extortionate sums for access to the APIs around this data.

The beauty of Tom’s suggestion is that there is no transmission of personal information between the vendor and the consumer. Thus it works without all that hassle of getting consumer consent because no private data actually leaves the user’s mobile device and the user is fully in control of the experience. I can imagine this working in other domains too, such as online advertising. In that case the ad server would suggest some pertinent advertisments based on the page content and the user’s browser would select which ones might be of interest to them. The browser would be configurable with user preferences or perhaps adaptive over time.

It’s not a win-win scenario though since the advertisers and marketers don’t get their hands on that lovely profile data that they so cherish. However, it’s a big win for consumer privacy.

Apparently the next Mashup event is on TV 2.0 so maybe we’ll see some Joost or miniweb presence.