Talis Research

Talis Research Blog
Talis Research

Talis Research

Previous Posts

Categories

Archive for April, 2011

Visualising and Analysing Massive Data – trip report Konstanz April 2011

I have recently returned from a trip, which included a PhD viva in Southampton and a visit to University of Athens and then ended up with three days in the heart of the south German countryside in the company of Daniel Kiem’s Data Analysis and Visualization research group at Konstanz, one of the key international groups in visualisation and visual analytics.  This was the group’s annual retreat at an intimate conference hotel run by a relative of one of the group.

Daniel had for a long time been the ‘Mr Pixels’ of visualisation with his ground breaking work on pixel plotting techniques, and following on from his early work, his group grew to be the foremost research group on visualisation in Europe.  However, in recent years Daniel has become the chief European proponent of the emerging field of visual analytics, including being scientific lead of the VisMaster EU Coordinated Action, which lead to the recent roadmap “Mastering the Information Age: Solving Problems with Visual Analytics“.

Visual Analytics is defined as “the science of analytical reasoning facilitated by interactive human-machine interfaces” (Wong and Thomas. Visual Analytics), and is all about harnessing the combined power of the best visualisation and latest machine learning techniques to tackle some of the hardest data-oriented problems from gene sequence matching to disaster management.  During the retreat Daniel recounted a recent meeting with a major politician.  He demonstrated a system visualising historic news data including sentiment analysis.  He entered the politician’s name to filter the stream, and she instantly recognised periods of high-and low popularity that she already knew about following high profile stories, but then she zeroed in on a place on the timeline with negative sentiment.  As they drilled into the data she saw that this was focused on a single country Kazakhstan and she found a particular major story there of which she had previously been unaware.

During the two and half days of the retreat there were around 20 different talks and presentations and each had something of interest.

One broad area that arose in different settings was how to deal with complex time series data.  Traditionally time-series data has been based on regular discrete numerical measurements such as hourly stock prices, or tidal flows.  However, data now often does not fit this model, involving infrequent, but often bursty event-based non-numeric data such sentiment in twitter feeds, and often in vast quantities, such as network analysis data.  Visualisations often include multiple views, and ways to drill down from aggregated views based on structural features such as geography, into particular facets and eventually individual events.

Another area that interested me was the analysis and visualisation of textual data, including streaming real-time text such as twitter and news stories.  While numerical information can often be reduced to simple lines or points in visualisations, to be meaningful text needs to be readable, creating special challenges for the visualisation of very large data sets.  In addition, the textual data often has additional attributes such as the temporal and geographic context of a news story.

As well as plain visualisations, the visual analytics nature of the group was evident in many presentations where different forms of clustering, natural language processing and machine learning were being used as part of the analytic process.  These were applied to a variety of application areas, including the sentiment analysis already mentioned, and also network security, bio-informatics and a large joint German–US project in the final stages of negotiation that will address the resilience of logistic, power and communications networks in the face of natural, technological and human failures.

Peter Bak was at the retreat.  He is an ex-member of the group and now at IBM research in Haifa.  He outlined some of the visualisation challenges he is finding at IBM from shipping logistics to Watson, the Jeopardy-playing computer, which recently won on live television against two past Jeopardy champions.  I have read about the latter before, during its earlier development, but it was fascinating to hear again about the combination of massively parallel and data intensive hypothesis generation followed by more orchestrated ranking and selection.  Whilst still very simple in comparison, it did capture some of the richness of our own ways of tackling problems, and also shows a tantalising glimpse of what can be possible through harnessing the web as data.

While the running of the Jeopardy computer happens in milliseconds the digital forensics to understand what went wrong on certain questions is expected to last nearly a year.  The former is the role purely of automated processing, but the later for human analytics.  I have found similar problems on much smaller datasets (Gb rather than Tb) when using spreading activation algorithms — when emergent results are not as expected it can be a real challenge to drill into massively distributed processes and make sense of the chains of tiny events that gave rise to the visible effect.  However, this seems a core issue for the future of data intensive applications.

Maybe the issue will hinge around layers of control.  In our own minds, many thoughts bubble almost arbitrarily into consciousness, and I am sure many more that we are never aware of, but our conscious processes filter and manage these into a coherent whole: for our own sense of what we are thinking about, for our action in the world, and for communicating to others.  It maybe the same with vast emergent parallel data processing applications, such as Watson; at some levels we may have to accept that things just work or don’t work and not be able to fully ‘debug’ them, but at a higher level, like our own conscious thoughts, we should expect more control, more robustness and more ability to explain and justify actions and decisions.

My own role as the retreat guest was to try to give fresh ideas and hopefully disrupt and inspire the group. This included two talks, running a Bad Ideas creativity session with Geoff, taking part in a session focused on evaluation and a panel on ‘self-marketing’ in research.

For one of my talks I focused on the potential challenges that semantic web data poses for visualisation (see slides).   While there is some work in the area (e.g.  Jean-Daniel Fekete‘s Aviz group at Inria), it is still under explored. The talk was structured around three phases starting with raw non-semantic data (CSV, RDMS, etc.) through to transformation into RDF and finally linked open data.  Some similar issues arise at the two ends of the spectrum, including issues of heterogeneity, some are particular to the semantic nature of data (e.g. the combination of structure and small units of free text in literals), and some to RDF (e.g. schema-less-ness).

The second talk gave some of the theoretical background to Bad Ideas and related creativity techniques (see slides).  As well as the divergent nature of the bad ideas themselves, I focused on the more convergent analytic aspects, in particular the importance of externalisation for external cognition and reflection.

‘Follow Your Nose’ Across the Globe

Imagine you’ve just arrived in an unfamiliar place, perhaps on a business trip (or recently beamed down from the Starship Enterprise). One of the first things you’ll probably want to do is find out what things are nearby. Google Maps provides a great “search nearby” function (try entering just a * to get everything), but this is geared more towards businesses, and the data isn’t exactly open, making it hard to reuse in other applications. We wanted to try something similar, using the growing range of liberally-licensed Linked Data sets with a geographic component. Here’s what we did…

We started with the Geonames data set (helpfully converted to RDF by Ian Davis and loaded into a Talis Platform store), all the geo-coded entries from DBpedia, and data about schools from data.gov.uk — we’ll call all the things described in these data sets ‘points of interest’. In truth the geocoded entries in DBpedia are frequently not points but regions/areas, but that’s an issue for another day.

Next, we implemented some cheap and cheerful secret sauce to identify, for each point of interest, all the other PoIs that are ‘nearby’. For each of these pairwise relationships we create two new RDF triples, one stating ‘x near y’, the other ‘y near x’. For example, our algorithm may reveal that Birkby Junior School is near to Kirklees Incinerator, in which case we simply create the RDF triples:

<http://education.data.gov.uk/id/school/107626> <http://open.vocab.org/terms/near> <http://dbpedia.org/resource/Kirklees_Incinerator> .
<http://dbpedia.org/resource/Kirklees_Incinerator> <http://open.vocab.org/terms/near> <http://education.data.gov.uk/id/school/107626> .

We do the same for all pair-wise relations in each cluster of ‘nearby’ items.

The process runs as a series of MapReduce jobs on top of Hadoop, creating a new data set of more than 700 million triples, all of which are links within and between data sets, i.e. those figures don’t include any literals or rdf:type statements describing the points in the original data sets. What we do create along the way are URIs for all the intersection points in the lat/long grid (down to a specific level of granularity). These are linked to the URIs of nearby things from the input data sets, and too each other. We haven’t done the sums yet to identify the proportion of inter- versus intra-data set links, though judicious use of grep and wc -l should address that.

Our definition of ‘near’ was deliberately left loosely defined. It’s way too subjective and context dependent to try and define more precisely, that’s why we kept it vague and left plenty of scope for refinement according to the needs of specific applications. We achieved this in the following ways:

Firstly, imagine an application uses this data set to find all the other things near to a particular point of interest, but also wants to know the distances to each and which is the closest. The data to answer these questions doesn’t need to be materialised in the data set, because the consuming application can dereference the URIs of the nearby points of interest and perform the post-processing of choice on the original geo-coordinates, such as computing the distances between points and finding which is nearest.

Extending this principle, it’s also trivial to use this data as input to a matching process that identifies when the same entity is described in different data sets (with each assigning a different URI). For example, our approach finds that the resource identified by the URI <http://dbpedia.org/resource/Gad%27s_Hill_School> is near to a number of other things, including the resource identified by the URI <http://education.data.gov.uk/id/school/118944>. In fact, both these URIs identify the same resource, "Gad's Hill School", and could be connected using the owl:sameAs property. The cost of assessing whether these two URIs identify the same resource, perhaps by computing the string similarity of the labels assigned to them in each data set, is much lower when using the set of ‘near’ links as input compared to using both data sets in their entirety, as the number of pairwise comparisons that must be made is significantly reduced.

The second mechanism we introduced to ensure the data set could be reused and refined where required was to make the data set as navigable as possible. A typical Web API requires each query to be formulated afresh before it is sent, meaning the data set isn’t easily browsed in an ad-hoc fashion. Instead, logic is required at the client/application-side in order to formulate the next query, before the client or user can move from one record to another. In contrast, this new data set applies — to the physical world — the same ‘follow your nose’ concept that’s so central to Linked Data. The concept is blindingly obvious when you stop and think about it. We navigate the physical world by following our noses, looking for landmarks, finding our bearings, and adjusting our courses — why should it be any different when we navigate the Web from a geographical perspective?

To support this mechanism, the data set includes links that connect each intersection point in the lat/long grid to those that surround it. These links are expressed in terms of compass bearings, allowing an application to move from grid point to grid point in whatever direction they choose, potentially traversing the globe in the process. In reality there are some gaps in coverage, primarily corresponding to oceans and areas of low population density that are typically under-represented in the input data sets; as we currently only materialise data for intersection points that have nearby PoIs, these areas are generally not covered by the grid.

We have a number of plans for enhancements to the data set (not least of which licensing statements and a proper voiD description), but in the meantime we’ve made an initial release of the data set, which is hosted in a Talis Platform store and exposed through the API at http://rdfize.com/geo/point/. The goal of this API is to provide a simple access mechanism for application developers wishing to find all points of interest near to a particular location. To achieve this, developers simply need to construct URIs of the form http://rdfize.com/geo/point/latitude/longitude/ e.g. http://rdfize.com/geo/point/9.022762/38.746719/ (a point close to the centre of Addis Ababa).

Clients performing an HTTP GET on one of these URIs will be redirected to a URI that identifies the nearest grid intersection point (in this case http://rdfize.com/geo/point/9.02/38.75/). From there the client will be redirected (following the widely used HTTP 303 + Content Negotiation pattern) to an HTML or RDF description of that point. Currently we support the RDF/XML, Turtle, N-Triples and RDF/JSON serialisations of RDF, as well as a simple HTML view for human users.

Those looking for a more polished interaction with the data set should sign up for the beta of Kasabi, where we plan to release enhanced versions of the data set in the coming months.