Talis Research

Talis Research Blog
Talis Research

Talis Research

Previous Posts

Categories

Enriching Linked Data in Brazil – Interview with SSSW2011 student Kelli de Faria Cordeiro

As part of our support for SSSW2011, the 8th Semantic Web Summer School, taking place this week in Spain, Talis is sponsoring a student to attend the school, who through lack of funds would not otherwise be able to attend. After a significant number of applications for the funded place and a challenging selection process, the grant was awarded to Kelli de Faria Cordeiro. Kelli is a PhD student in the Knowledge Engineering Group (GRECO) of the Computer Science Department / Institute of Mathematics at the Federal University of Rio de Janeiro.

Kelli de Faria Cordeiro

Just before we arrived in Spain for the Summer School, I caught up with Kelli to find out about her research and her hopes for the Summer School.

Tom: In simple terms, what is the focus of your research?

Kelli: My research is centred on Advanced Conceptual Modelling of Complex Information Systems, focusing on Linked Open Data as a Complex Information System. The main issue is how to semantically enrich Linked Data keeping the flexibility of Linked Data Principles.

Tom: How did you come to be doing research in this area? What led you here?

Kelli: I have been studying semantic web for the past 5 years, and I have been an enthusiast on data integration and analysis during my whole professional life. The Linked Data Principles as an approach to integrate heterogeneous and dynamic data called my attention. The possibilities to create a data analysis environment with the data available on the web took my thoughts since I started to study it.

Tom: Could you explain a bit about the context in which your research is taking place?

Kelli: Currently, I am working on the LinkedDataBR Project – Expose, Share and Connection of Open Data Resources on the Web, supported by the National Education and Research Network (RNP) of Brazil. Central functionalities to be included are data cleaning, transformation, association, annotation and referencing to terminology mechanisms. At this project, we have been facing critical issues about the role of ontologies on Linked Open Data Publish Process, and I hope to address some of these issues with my research work. Moreover, there has been a Brazilian governmental movement to open data, leading to the development of applied projects to support it with tools and guidelines with a broad scope. The consequence is the need for human resources qualification, and I also expect to meet this demand with my studies on the subject.

Tom: What are you most looking forward to about the Semantic Web Summer School?

Kelli: I want to improve my learning on Ontology Engineering, Knowledge Representation and Linked Open Data by developing my technical and social skills, having great time with tutors and other students. Besides the contributions to my PhD Thesis, one of the main objectives of attending the school is to learn as much as I can and then share the knowledge with my colleagues in Brazil with whom I am working on Linked Data Research Projects.

Tom: In what ways do you expect the Summer School to benefit your research?

Kelli: The development of my research can benefit as I learn the current key topics in the field. It can also be improved with discussion and validation of the ideas and approaches to solve my research problems with tutors, invited speakers and other students. I look forward discussing the description of aggregate data as well as analytical processing over Linked Data.

‘Follow Your Nose’ Across the Globe

Imagine you’ve just arrived in an unfamiliar place, perhaps on a business trip (or recently beamed down from the Starship Enterprise). One of the first things you’ll probably want to do is find out what things are nearby. Google Maps provides a great “search nearby” function (try entering just a * to get everything), but this is geared more towards businesses, and the data isn’t exactly open, making it hard to reuse in other applications. We wanted to try something similar, using the growing range of liberally-licensed Linked Data sets with a geographic component. Here’s what we did…

We started with the Geonames data set (helpfully converted to RDF by Ian Davis and loaded into a Talis Platform store), all the geo-coded entries from DBpedia, and data about schools from data.gov.uk — we’ll call all the things described in these data sets ‘points of interest’. In truth the geocoded entries in DBpedia are frequently not points but regions/areas, but that’s an issue for another day.

Next, we implemented some cheap and cheerful secret sauce to identify, for each point of interest, all the other PoIs that are ‘nearby’. For each of these pairwise relationships we create two new RDF triples, one stating ‘x near y’, the other ‘y near x’. For example, our algorithm may reveal that Birkby Junior School is near to Kirklees Incinerator, in which case we simply create the RDF triples:

<http://education.data.gov.uk/id/school/107626> <http://open.vocab.org/terms/near> <http://dbpedia.org/resource/Kirklees_Incinerator> .
<http://dbpedia.org/resource/Kirklees_Incinerator> <http://open.vocab.org/terms/near> <http://education.data.gov.uk/id/school/107626> .

We do the same for all pair-wise relations in each cluster of ‘nearby’ items.

The process runs as a series of MapReduce jobs on top of Hadoop, creating a new data set of more than 700 million triples, all of which are links within and between data sets, i.e. those figures don’t include any literals or rdf:type statements describing the points in the original data sets. What we do create along the way are URIs for all the intersection points in the lat/long grid (down to a specific level of granularity). These are linked to the URIs of nearby things from the input data sets, and too each other. We haven’t done the sums yet to identify the proportion of inter- versus intra-data set links, though judicious use of grep and wc -l should address that.

Our definition of ‘near’ was deliberately left loosely defined. It’s way too subjective and context dependent to try and define more precisely, that’s why we kept it vague and left plenty of scope for refinement according to the needs of specific applications. We achieved this in the following ways:

Firstly, imagine an application uses this data set to find all the other things near to a particular point of interest, but also wants to know the distances to each and which is the closest. The data to answer these questions doesn’t need to be materialised in the data set, because the consuming application can dereference the URIs of the nearby points of interest and perform the post-processing of choice on the original geo-coordinates, such as computing the distances between points and finding which is nearest.

Extending this principle, it’s also trivial to use this data as input to a matching process that identifies when the same entity is described in different data sets (with each assigning a different URI). For example, our approach finds that the resource identified by the URI <http://dbpedia.org/resource/Gad%27s_Hill_School> is near to a number of other things, including the resource identified by the URI <http://education.data.gov.uk/id/school/118944>. In fact, both these URIs identify the same resource, "Gad's Hill School", and could be connected using the owl:sameAs property. The cost of assessing whether these two URIs identify the same resource, perhaps by computing the string similarity of the labels assigned to them in each data set, is much lower when using the set of ‘near’ links as input compared to using both data sets in their entirety, as the number of pairwise comparisons that must be made is significantly reduced.

The second mechanism we introduced to ensure the data set could be reused and refined where required was to make the data set as navigable as possible. A typical Web API requires each query to be formulated afresh before it is sent, meaning the data set isn’t easily browsed in an ad-hoc fashion. Instead, logic is required at the client/application-side in order to formulate the next query, before the client or user can move from one record to another. In contrast, this new data set applies — to the physical world — the same ‘follow your nose’ concept that’s so central to Linked Data. The concept is blindingly obvious when you stop and think about it. We navigate the physical world by following our noses, looking for landmarks, finding our bearings, and adjusting our courses — why should it be any different when we navigate the Web from a geographical perspective?

To support this mechanism, the data set includes links that connect each intersection point in the lat/long grid to those that surround it. These links are expressed in terms of compass bearings, allowing an application to move from grid point to grid point in whatever direction they choose, potentially traversing the globe in the process. In reality there are some gaps in coverage, primarily corresponding to oceans and areas of low population density that are typically under-represented in the input data sets; as we currently only materialise data for intersection points that have nearby PoIs, these areas are generally not covered by the grid.

We have a number of plans for enhancements to the data set (not least of which licensing statements and a proper voiD description), but in the meantime we’ve made an initial release of the data set, which is hosted in a Talis Platform store and exposed through the API at http://rdfize.com/geo/point/. The goal of this API is to provide a simple access mechanism for application developers wishing to find all points of interest near to a particular location. To achieve this, developers simply need to construct URIs of the form http://rdfize.com/geo/point/latitude/longitude/ e.g. http://rdfize.com/geo/point/9.022762/38.746719/ (a point close to the centre of Addis Ababa).

Clients performing an HTTP GET on one of these URIs will be redirected to a URI that identifies the nearest grid intersection point (in this case http://rdfize.com/geo/point/9.02/38.75/). From there the client will be redirected (following the widely used HTTP 303 + Content Negotiation pattern) to an HTML or RDF description of that point. Currently we support the RDF/XML, Turtle, N-Triples and RDF/JSON serialisations of RDF, as well as a simple HTML view for human users.

Those looking for a more polished interaction with the data set should sign up for the beta of Kasabi, where we plan to release enhanced versions of the data set in the coming months.

Introducing Talis Research

When I joined Talis in 2008, a number of my peers in the Semantic Web community commented on the fact that a relatively small company was employing recent Ph.D. graduates. Any development of this sort is a useful data point for gauging the commercial interest in a particular research field (not to mention a whisper of reassurance for those toiling on the rocky road through a Ph.D.!). But what seemed to raise the most eyebrows was that I was joining Talis as a Researcher. Everyone seemed to agree this was a pretty bold move for a company of our size.

So what’s in it for us? Why do we invest in research rather than increasing the dividend we pay to shareholders each year? We do this because we believe that we can be more successful as a company by investing in research, as a driver for medium- and long-term growth of existing strands of the business, and as a potential source of new areas of business we haven’t yet imagined.

Many people who have worked closely with us will know that we value innovation across all areas of the business, while recognising that this means different things depending on the maturity of the corresponding market. So if we value innovation across the business as a whole, why invest in creating a dedicated research team? Shouldn’t all teams be responsible for conducting their own research, as their work priorities demand?

There will always be a strong element of this at Talis, as many of our recent developments demonstrate (take Aspire and Kasabi for example). However, any researcher will tell you (as will anyone else who has to combine research with other responsibilities), conducting considered, rigorous, fundamental research requires certain freedoms that other teams at Talis don’t always have — their work is too important; freedom to explore new ideas without fear of failure, and freedom to dig deep into those ideas without distraction. Talis provides a superb environment for the former, and I’m working on my own abilities to deliver on the latter!

This highlights a fundamental contrast between Talis Research and my experience of research in an academic context (admittedly this is limited to my time as a Ph.D. student in the early 21st century, and is certainly no reflection on KMi and The Open University, one of the most supportive and creative research environments I’ve ever set foot in). At Talis we will need to spend significant periods of time with our ‘research attention‘ far closer to the coal-face than a typical university research group is used to. This can be unsettling if we measure ourselves in conventional academic terms.

The flip side is this: with no requirement on publications as a metric of success, unless we want them to be, we can free ourselves up to ask the questions to which we as a company want answers, safe in the knowledge that it’s the quality of the answers that matter, irrespective of the outcome.

Do we know precisely how to answer these questions, in this specific context? Do we know precisely how Research fits within the growing, evolving organism that is Talis? No. Not fully. But as we’re growing we’re working it out. This blog is the start of our account of the journey. We hope to see you along the way.

Tom Heath.