Talis Research

Talis Research Blog
Talis Research

Talis Research

Previous Posts

Categories

Archive for the 'General' Category

‘Follow Your Nose’ Across the Globe

Imagine you’ve just arrived in an unfamiliar place, perhaps on a business trip (or recently beamed down from the Starship Enterprise). One of the first things you’ll probably want to do is find out what things are nearby. Google Maps provides a great “search nearby” function (try entering just a * to get everything), but this is geared more towards businesses, and the data isn’t exactly open, making it hard to reuse in other applications. We wanted to try something similar, using the growing range of liberally-licensed Linked Data sets with a geographic component. Here’s what we did…

We started with the Geonames data set (helpfully converted to RDF by Ian Davis and loaded into a Talis Platform store), all the geo-coded entries from DBpedia, and data about schools from data.gov.uk — we’ll call all the things described in these data sets ‘points of interest’. In truth the geocoded entries in DBpedia are frequently not points but regions/areas, but that’s an issue for another day.

Next, we implemented some cheap and cheerful secret sauce to identify, for each point of interest, all the other PoIs that are ‘nearby’. For each of these pairwise relationships we create two new RDF triples, one stating ‘x near y’, the other ‘y near x’. For example, our algorithm may reveal that Birkby Junior School is near to Kirklees Incinerator, in which case we simply create the RDF triples:

<http://education.data.gov.uk/id/school/107626> <http://open.vocab.org/terms/near> <http://dbpedia.org/resource/Kirklees_Incinerator> .
<http://dbpedia.org/resource/Kirklees_Incinerator> <http://open.vocab.org/terms/near> <http://education.data.gov.uk/id/school/107626> .

We do the same for all pair-wise relations in each cluster of ‘nearby’ items.

The process runs as a series of MapReduce jobs on top of Hadoop, creating a new data set of more than 700 million triples, all of which are links within and between data sets, i.e. those figures don’t include any literals or rdf:type statements describing the points in the original data sets. What we do create along the way are URIs for all the intersection points in the lat/long grid (down to a specific level of granularity). These are linked to the URIs of nearby things from the input data sets, and too each other. We haven’t done the sums yet to identify the proportion of inter- versus intra-data set links, though judicious use of grep and wc -l should address that.

Our definition of ‘near’ was deliberately left loosely defined. It’s way too subjective and context dependent to try and define more precisely, that’s why we kept it vague and left plenty of scope for refinement according to the needs of specific applications. We achieved this in the following ways:

Firstly, imagine an application uses this data set to find all the other things near to a particular point of interest, but also wants to know the distances to each and which is the closest. The data to answer these questions doesn’t need to be materialised in the data set, because the consuming application can dereference the URIs of the nearby points of interest and perform the post-processing of choice on the original geo-coordinates, such as computing the distances between points and finding which is nearest.

Extending this principle, it’s also trivial to use this data as input to a matching process that identifies when the same entity is described in different data sets (with each assigning a different URI). For example, our approach finds that the resource identified by the URI <http://dbpedia.org/resource/Gad%27s_Hill_School> is near to a number of other things, including the resource identified by the URI <http://education.data.gov.uk/id/school/118944>. In fact, both these URIs identify the same resource, "Gad's Hill School", and could be connected using the owl:sameAs property. The cost of assessing whether these two URIs identify the same resource, perhaps by computing the string similarity of the labels assigned to them in each data set, is much lower when using the set of ‘near’ links as input compared to using both data sets in their entirety, as the number of pairwise comparisons that must be made is significantly reduced.

The second mechanism we introduced to ensure the data set could be reused and refined where required was to make the data set as navigable as possible. A typical Web API requires each query to be formulated afresh before it is sent, meaning the data set isn’t easily browsed in an ad-hoc fashion. Instead, logic is required at the client/application-side in order to formulate the next query, before the client or user can move from one record to another. In contrast, this new data set applies — to the physical world — the same ‘follow your nose’ concept that’s so central to Linked Data. The concept is blindingly obvious when you stop and think about it. We navigate the physical world by following our noses, looking for landmarks, finding our bearings, and adjusting our courses — why should it be any different when we navigate the Web from a geographical perspective?

To support this mechanism, the data set includes links that connect each intersection point in the lat/long grid to those that surround it. These links are expressed in terms of compass bearings, allowing an application to move from grid point to grid point in whatever direction they choose, potentially traversing the globe in the process. In reality there are some gaps in coverage, primarily corresponding to oceans and areas of low population density that are typically under-represented in the input data sets; as we currently only materialise data for intersection points that have nearby PoIs, these areas are generally not covered by the grid.

We have a number of plans for enhancements to the data set (not least of which licensing statements and a proper voiD description), but in the meantime we’ve made an initial release of the data set, which is hosted in a Talis Platform store and exposed through the API at http://rdfize.com/geo/point/. The goal of this API is to provide a simple access mechanism for application developers wishing to find all points of interest near to a particular location. To achieve this, developers simply need to construct URIs of the form http://rdfize.com/geo/point/latitude/longitude/ e.g. http://rdfize.com/geo/point/9.022762/38.746719/ (a point close to the centre of Addis Ababa).

Clients performing an HTTP GET on one of these URIs will be redirected to a URI that identifies the nearest grid intersection point (in this case http://rdfize.com/geo/point/9.02/38.75/). From there the client will be redirected (following the widely used HTTP 303 + Content Negotiation pattern) to an HTML or RDF description of that point. Currently we support the RDF/XML, Turtle, N-Triples and RDF/JSON serialisations of RDF, as well as a simple HTML view for human users.

Those looking for a more polished interaction with the data set should sign up for the beta of Kasabi, where we plan to release enhanced versions of the data set in the coming months.

Introducing Talis Research

When I joined Talis in 2008, a number of my peers in the Semantic Web community commented on the fact that a relatively small company was employing recent Ph.D. graduates. Any development of this sort is a useful data point for gauging the commercial interest in a particular research field (not to mention a whisper of reassurance for those toiling on the rocky road through a Ph.D.!). But what seemed to raise the most eyebrows was that I was joining Talis as a Researcher. Everyone seemed to agree this was a pretty bold move for a company of our size.

So what’s in it for us? Why do we invest in research rather than increasing the dividend we pay to shareholders each year? We do this because we believe that we can be more successful as a company by investing in research, as a driver for medium- and long-term growth of existing strands of the business, and as a potential source of new areas of business we haven’t yet imagined.

Many people who have worked closely with us will know that we value innovation across all areas of the business, while recognising that this means different things depending on the maturity of the corresponding market. So if we value innovation across the business as a whole, why invest in creating a dedicated research team? Shouldn’t all teams be responsible for conducting their own research, as their work priorities demand?

There will always be a strong element of this at Talis, as many of our recent developments demonstrate (take Aspire and Kasabi for example). However, any researcher will tell you (as will anyone else who has to combine research with other responsibilities), conducting considered, rigorous, fundamental research requires certain freedoms that other teams at Talis don’t always have — their work is too important; freedom to explore new ideas without fear of failure, and freedom to dig deep into those ideas without distraction. Talis provides a superb environment for the former, and I’m working on my own abilities to deliver on the latter!

This highlights a fundamental contrast between Talis Research and my experience of research in an academic context (admittedly this is limited to my time as a Ph.D. student in the early 21st century, and is certainly no reflection on KMi and The Open University, one of the most supportive and creative research environments I’ve ever set foot in). At Talis we will need to spend significant periods of time with our ‘research attention‘ far closer to the coal-face than a typical university research group is used to. This can be unsettling if we measure ourselves in conventional academic terms.

The flip side is this: with no requirement on publications as a metric of success, unless we want them to be, we can free ourselves up to ask the questions to which we as a company want answers, safe in the knowledge that it’s the quality of the answers that matter, irrespective of the outcome.

Do we know precisely how to answer these questions, in this specific context? Do we know precisely how Research fits within the growing, evolving organism that is Talis? No. Not fully. But as we’re growing we’re working it out. This blog is the start of our account of the journey. We hope to see you along the way.

Tom Heath.