Visualising and Analysing Massive Data – trip report Konstanz April 2011
I have recently returned from a trip, which included a PhD viva in Southampton and a visit to University of Athens and then ended up with three days in the heart of the south German countryside in the company of Daniel Kiem’s Data Analysis and Visualization research group at Konstanz, one of the key international groups in visualisation and visual analytics. This was the group’s annual retreat at an intimate conference hotel run by a relative of one of the group.
Daniel had for a long time been the ‘Mr Pixels’ of visualisation with his ground breaking work on pixel plotting techniques, and following on from his early work, his group grew to be the foremost research group on visualisation in Europe. However, in recent years Daniel has become the chief European proponent of the emerging field of visual analytics, including being scientific lead of the VisMaster EU Coordinated Action, which lead to the recent roadmap “Mastering the Information Age: Solving Problems with Visual Analytics“.

Visual Analytics is defined as “the science of analytical reasoning facilitated by interactive human-machine interfaces” (Wong and Thomas. Visual Analytics), and is all about harnessing the combined power of the best visualisation and latest machine learning techniques to tackle some of the hardest data-oriented problems from gene sequence matching to disaster management. During the retreat Daniel recounted a recent meeting with a major politician. He demonstrated a system visualising historic news data including sentiment analysis. He entered the politician’s name to filter the stream, and she instantly recognised periods of high-and low popularity that she already knew about following high profile stories, but then she zeroed in on a place on the timeline with negative sentiment. As they drilled into the data she saw that this was focused on a single country Kazakhstan and she found a particular major story there of which she had previously been unaware.
During the two and half days of the retreat there were around 20 different talks and presentations and each had something of interest.
One broad area that arose in different settings was how to deal with complex time series data. Traditionally time-series data has been based on regular discrete numerical measurements such as hourly stock prices, or tidal flows. However, data now often does not fit this model, involving infrequent, but often bursty event-based non-numeric data such sentiment in twitter feeds, and often in vast quantities, such as network analysis data. Visualisations often include multiple views, and ways to drill down from aggregated views based on structural features such as geography, into particular facets and eventually individual events.
Another area that interested me was the analysis and visualisation of textual data, including streaming real-time text such as twitter and news stories. While numerical information can often be reduced to simple lines or points in visualisations, to be meaningful text needs to be readable, creating special challenges for the visualisation of very large data sets. In addition, the textual data often has additional attributes such as the temporal and geographic context of a news story.
As well as plain visualisations, the visual analytics nature of the group was evident in many presentations where different forms of clustering, natural language processing and machine learning were being used as part of the analytic process. These were applied to a variety of application areas, including the sentiment analysis already mentioned, and also network security, bio-informatics and a large joint German–US project in the final stages of negotiation that will address the resilience of logistic, power and communications networks in the face of natural, technological and human failures.
Peter Bak was at the retreat. He is an ex-member of the group and now at IBM research in Haifa. He outlined some of the visualisation challenges he is finding at IBM from shipping logistics to Watson, the Jeopardy-playing computer, which recently won on live television against two past Jeopardy champions. I have read about the latter before, during its earlier development, but it was fascinating to hear again about the combination of massively parallel and data intensive hypothesis generation followed by more orchestrated ranking and selection. Whilst still very simple in comparison, it did capture some of the richness of our own ways of tackling problems, and also shows a tantalising glimpse of what can be possible through harnessing the web as data.
While the running of the Jeopardy computer happens in milliseconds the digital forensics to understand what went wrong on certain questions is expected to last nearly a year. The former is the role purely of automated processing, but the later for human analytics. I have found similar problems on much smaller datasets (Gb rather than Tb) when using spreading activation algorithms — when emergent results are not as expected it can be a real challenge to drill into massively distributed processes and make sense of the chains of tiny events that gave rise to the visible effect. However, this seems a core issue for the future of data intensive applications.
Maybe the issue will hinge around layers of control. In our own minds, many thoughts bubble almost arbitrarily into consciousness, and I am sure many more that we are never aware of, but our conscious processes filter and manage these into a coherent whole: for our own sense of what we are thinking about, for our action in the world, and for communicating to others. It maybe the same with vast emergent parallel data processing applications, such as Watson; at some levels we may have to accept that things just work or don’t work and not be able to fully ‘debug’ them, but at a higher level, like our own conscious thoughts, we should expect more control, more robustness and more ability to explain and justify actions and decisions.
My own role as the retreat guest was to try to give fresh ideas and hopefully disrupt and inspire the group. This included two talks, running a Bad Ideas creativity session with Geoff, taking part in a session focused on evaluation and a panel on ‘self-marketing’ in research.
For one of my talks I focused on the potential challenges that semantic web data poses for visualisation (see slides). While there is some work in the area (e.g. Jean-Daniel Fekete‘s Aviz group at Inria), it is still under explored. The talk was structured around three phases starting with raw non-semantic data (CSV, RDMS, etc.) through to transformation into RDF and finally linked open data. Some similar issues arise at the two ends of the spectrum, including issues of heterogeneity, some are particular to the semantic nature of data (e.g. the combination of structure and small units of free text in literals), and some to RDF (e.g. schema-less-ness).
The second talk gave some of the theoretical background to Bad Ideas and related creativity techniques (see slides). As well as the divergent nature of the bad ideas themselves, I focused on the more convergent analytic aspects, in particular the importance of externalisation for external cognition and reflection.





Recent Comments