Subscribe

ScaleCamp 2010

On the 10th of December 2010 a couple of us from the Platform Engineering team attended ScaleCamp 2010 at the Guardian offices in London. Very much like its bigger, older (second?) cousin Velocity, Scalecamp is a gathering of developers, operations folk and other people with an interest in scaling systems to support increasing numbers of data-hungry users in the post Web 2.0 age. Scalecamp aims to fill the gap for UK-based peeps who want to get in on the scalability chin-wagging and knowledge-sharing act. Smaller than Velocity or new-kid-on-the-block Surge, Scalecamp is now in its second year and still small enough to use the unconference format, allowing attendees to self-organise around whatever subjects float their scalability boats.

ScaleCamp

Pastries & Scaling your team

The day began with an empty timetable with slots for 40 minute sessions across 5 rooms of varying sizes. And some cheeky pastries. By lunchtime the board was pretty much full, with some intriguing sessions on the cards. First one to tickle my personal fancy was a discussion on how to scale teams. Talk of scaling teams made me remember the phrase “meat cloud”, which still makes me giggle. Like many engineering teams, we pretty much always have more work to do than we can get through, or at least get through for some value of “now”. Adding a good engineer or two (and if you’re a good engineer, we’d love to hear from you) would help us to go a little bit faster, and who doesn’t want that? So we’re certainly searching for the mythical “elastic meat cloud”; turn up the dial, add a few more people, and hey presto, you’re a team scaling guru!

Hmmmm, pastries!

The discussion touched on areas including technical architecture, how to attract and retain good people, and which working practices scale up best in different environments. We pretty much unanimously preferred a modular architecture to a monolithic “big ball of mud”. Loosely coupled components and services make it easier for multiple developers to work concurrently on the same system. An additional benefit is that you don’t need to understand the whole system before you can start to work on part of it, making it easier for new people to contribute earlier.

Good unit and acceptance test suites were also raised as technical concerns that can reduce the friction of adding new people to a project. The lurking fear of silently breaking something you don’t yet understand will certainly slow down new hires.

Handily, we managed to avoid any serious dogma wars while discussing process and methodology, although most of the talk was about various forms of agile approach and what size of team they scale to. Interesting to hear the experiences of people who had been using Scrum with teams of around 20 developers, which appears to be pushing the limits a bit, judging from their testimony. Also discussed was the question of when you need to start some form of line management, whether technical, admin-focused or both. How many people can usefully report directly to the same person? At what point does this start to become unworkable?

File Systems are shiny too!

Next up was a man standing in front of a room full of techies and inviting them to pull his system architecture to pieces. In a nice way. Richard Jones is building a browser-based IRC client that maintains user sessions even when the browser is closed. Richard outlined the requirements and characteristics of his app; append only (no edits), no joining between users, no search, allows users to download logs, page back to see chat they missed, and so on. His goal was to get some ideas to help him scale the app, which he expected may entail replacing the PostgreSQL back-end with something else.

The architecture currently uses table inheritance in Postgres to achieve vertical partitioning. There is one RDBMS table per day’s worth of data, so the data is basically sharded by day. This allows cheap deletes via SQL “DROP TABLE”, as opposed to “DELETE FROM”.

Shiny!

A brief discussion of various sharding strategies took place. The well documented foursquare outage was mentioned to illustrate the potential pitfalls of sharding randomly on user name; this can lead to hotspots in the cluster that can be tricky to manage. There was a certain irony in the fact that I was expecting this discussion to focus on one or more of the shiny new NoSQL databases as a replacement for Postgres, but ultimately it took a turn towards solutions that used good old file systems to manage data storage. Clearly we can also find shiny new work in the file system space too, but I suppose the takeaway here is to use whatever tool does the specific job you need, shiny or otherwise.

Analysing droppings using Hadoop

Matt Biddulph of Nokia hosted a session where he outlined work he has been doing to analyse massive datasets about cities. Matt described the process of collecting log files from assorted Nokia applications and analysing them as “inspecting their droppings”. Using these “droppings”, Matt has been able to do things like produce heat maps that visualise which map locations people inspect most regularly on their phones. In general terms, the approach he has used for this is to analyse these massive datasets in Hadoop, then take the resulting, much smaller data and load that into an RDBMS for querying. This approach seems to be the most popular one right now for finding interesting relationships and patterns in big data, although we were all hoping somebody in the room had been doing something different and funky we could learn about, analysing massive data in a more online fashion. Maybe next year.

Eventually Matt wants to be able to use Hadoop to calculate various types of ground truths offline, for example the “normal” number of active Nokia devices in the Notting Hill area. A comparison of streaming data against these ground truths could then highlight interesting patterns, for example how much busier are various locations in Notting Hill during carnival weekend? The possibilities of using the streaming data could extend even further, for example to answer questions like “Which bars in the area are currently too crowded to bother going to, and which are worth a visit?”. Now that’s an app I’d snap up from the Android market place without a second thought.

Gentlemen, let’s broaden our minds

As a developer who has spent most of his career working on various back-end applications, I enjoyed attending a couple of sessions that covered subject matter outside my usual domain. Firstly, Spike Morelli described a systems configuration approach to managing a cluster of several thousand nodes by using a config management tool to roll out only entire images. The QA department apparently loved this, because the release as rolled out was exactly the same as the thing they signed off after testing.

Secondly, Premasagar Rose hosted a session on design patterns for JavaScript performance. Topics covered included JQuery tips, caching data in the browser as JSON values, and making as few DOM calls as possible. A couple of interesting tools were mentioned in the form of jsperf.com and Web Inspector.

Fail at failing

I also enjoyed Andrew Betts‘ session on handling errors at scale. Although initially PHP focused, there was a lot of general wisdom covered in the discussion. People compared notes on logging strategies, monitoring tools, and assorted low-level nitty-gritty. One such hard-won nugget was the value of assigning a unique ID to each request in a distributed system so you can follow it as it moves from one component to the next. We have learned this the hard way here at Talis while attempting to trace SPARQL queries from the Platform web servers through to the RDF stores at the back-end. The “X-TALIS-RESPONSE-ID” header you see in your HTTP response to a SPARQL query is a unique identifier that enables us to see what went on with an individual request all the way through the Platform’s stack. Big Brother sees all, innit?

That’s all very well, but when do I get the X-Ray glasses & exploding cigars?

Scalecamp organiser Michael Brunton-Spall, who deserves enormous credit for his creation, hosted a session at the tail-end of the day. Michael introduced an approach used by the tech team at the Guardian to analyse a technical crisis after the event. The Analysis of Competing Hypotheses is a technique formulated by the CIA in the 1970′s to help identify a wide set of hypotheses and provide a means to evaluate each when looking for explanations of complex problems. Interestingly, there is an open source project providing software to help you do this. The CIA and open source – strange bedfellows indeed, no? Whatever next, the FBI opening a sustainable hemp farm?

A spy

To illustrate the process, Michael used a real example from the Guardian so fresh it was still warm. A week or so before Scalecamp, the Guardian’s website had slowed to a crawl just before a scheduled live Q & A with WikiLeaks’ Julian Assange. We were asked to shout out possible causes, e.g. “Denial of service attack”, “Too many comments on a page”, and so on. Then we attempted to think of what evidence would prove or disprove each. A lightweight version of the full CIA methodology. Our own root cause analysis usually incorporates the 5 whys, but ACH looks like another useful tool to have at our disposal. Plus, we get to pretend we’re spies, although we’ll probably stop just short of the water boarding.

Vocamp Glasgow 2009

This week saw the first Vocamp in Scotland, held at the University of Strathclyde, Glasgow.

Vocamp Glasgow 2009

Attendees came from a wide range of different and interesting problem-spaces and domains and gave a lot of great presentations on their work. The range was too broad, perhaps, for us to find enough commonality to collaborate on creating/fixing any vocabularies (the focus of the previous vocamps I’ve attended), but it was great to have together so many people with an interest in the semantic web in the locality, and the presentations were all really good.

Jeff Pan and Edward Thomas from Aberdeen University presented some great tutorials that covered a lot of ground, from RDFa, OWL2 and data-modeling methodology with Protegè.
Jeff Pan on OWL 2. (I especially liked the slide explaining how machines understand markup.)

Norman Gray and Stuart Chalmers presented their work on creating SKOS mappings between astronomy vocabularies.

Norman Gray on vocabulary mapping with SKOS

Jenny Ure from Edinburgh University talked about some of her work on the Socio-technical aspect of collaborative ontologies and knowledge systems.

Jenny Ure

Peter Winstanley talked about some of the data curated by the Scottish Government, and showcased Semantic Mediawiki for ontology development, and some different options for ontology visualisation.

Peter also pointed to the Communities Of Practice for local Government Scottish Group: Shared Representation using Semantic Technologies , inviting anyone with an interest in Semantic technologies to join and contribute to the discussion forums.

Peter Winstanley on Ontology visualisation and Scottish Gov Data

Serge Boucher from Brussels talked about some of the exciting possibilities for location and context-aware semantic web services.

Serge Boucher on Location Based Semantic Services

Gordon Dunsire from the Centre for Digital Library Research presented on vocabularies, standards, and linked data in the library domain, making particular mention of the dramatic tale of the development of the Library of Congress Subject Headings Dataset.

Gordon Dunsire on  Linked Data, vocabularies, and library metadata

Martin Dempster from University of Dundee presented his research into Assistive Technologies helping people that have difficulties talking to communicate, his use of ontologies to manage the data in his prototype system, and consuming data from popular social web 2.0 sites to generate conversational choices.

Martin Dempster on Semantic enhanced Assistive Technology

The event was hosted and facilitated by Paola Di Maio from the University of Strathclyde; thanks to Paola for organising the event, the university for laying on wifi and tea and coffee, and Talis for sponsoring the lunches.

European Semantic Web Conference 2008 @ Tenerife

Last week Tom and I were in Tenerife for the European Semantic Web Conference, where he was chairing a session, and I was presenting a short paper on RDF/JSON, both at the Semantic Scripting Workshop.

Scripting Workshop

The scripting workshop itself was excellent – I enjoyed all of the others’ papers a lot, and look forward to playing with the code, ideas, and applications they presented.

The paper just before me was about using RDFa and javascript to allow in-page editing of resources. One particularly nice thing about it was that they implemented our RDF/JSON specification in their API (cheers guys! ;) ).

The scripting challenge was won by the highly deserving but sadly absent Benjamin Nowack with SPARQLBot – his IRC bot that can read and answer questions from RDF data sources. The second prize was won by Alexandre Passant and co for their Semantic Microblogging system, which looks very interesting indeed.

SPARQL

There were quite a few interesting papers on SPARQL – extending it in various ways, or extending SPARQL into other technologies. There were two papers on using SPARQL to bridge the XML and RDF divide: one on embedding SPARQL in XSLT extension functions (‘enabling the developer to combine XSLT and RDF in a way that doesn’t suck’) ; another on combining XQuery and SPARQL.

One paper I thought was especially interesting was about extending SPARQL to work on streams of data.

The best paper award was on SPARQL, and won by Christoph Kiefer, Abraham Bernstein, and André Locher for “Adding Data Mining Support to SPARQL via Statistical Relational Learning Methods”.

Vocabularies and Ontologies

Richard Cyganiak presented Neologism – an open source drupal-based web application for creating and publishing vocabularies. What is great about this application is not its features, but its philosophy. While desktop applications like Protégé may provide lots of features for designing, creating and reasoning with ontologies, they don’t help with the publishing of them – which can be a rather tricky issue. The idea behind Neologism is to make it easy for vocabulary authors to do the right thing, and author and publish their vocabularies according to best practice – an aim I really applaud.

I also attended a tutorial on developing ontologies with the use of patterns – starting by reusing some basic modeling patterns, which could be used as a mold and later discarded when the design was complete. The tutorial incorporated this into an XP system of development, involving test-driven ontology design – which I thought was an interesting idea.

voiD

Michael Hausenblas presenting voiD

One thing from the conference I particularly enjoyed was Michael Hausenblas, Richard Cyganiak, Jun Zhao, and I developing Michael’s idea of metaLOD into a vocabulary for describing datasets like those in Richard’s famous LOD diagram (see the slide in the photo above).

The idea was to come up with a light-weight vocabulary that would enable RDF descriptions of interlinked datasets; these descriptions (and so the various access points to the datasets) can be made discoverable via the Semantic Sitemaps extension, and aggregated via services like Sindice. I have a practical interest in this as well, since with the Platform (and our work on the Open Data License), we want to enable people to publish lots of datasets on the web. We already have lots of interesting datasets in the platform (which I have been doing some work on describing in the silkworm-dev store), and we are really keen to make our publicly available datasets discoverable by machines and humans, and available for reuse.

We spent about an hour discussing the weighty issues of scope, vocabulary reuse, and ontology modeling patterns – then at least a further hour trying to come up with a suitable acronym. Finally Laurian advised us not to think of an acronym, but of a cool-sounding word that encapsulated what the vocabulary would give to the world. So we came up with voiD: a Vocabulary Of Interlinked Datasets.

(see also Orri’s “VOID, Or will the LOD Cloud Bring Rain”)

Semantic Games

I saw two papers on games and the semantic web at ESWC. The first was Knud Möller‘s highly entertaining talk on World of WebCraft – Mashing up World of Warcraft and the Web at the Semantic Scripting Workshop; where he showed how he gleaned semantics from the game by scripting addons, mashing it up with data from dbpedia, and screenshots from flickr.

OntoGame presentation
The second was Katharina Siorpaes‘ presentation of her work on OntoGame, an application of Luis von Ahn’s Games With a Purpose concept to using online multiplayer games for getting people to perform what might otherwise be rather dull tasks in ontology creation and alignment, and data annotation. The idea centers around using blind collaborative game-play to achieve consensus and accuracy on what is common knowledge to humans, yet opaque to machines. I wondered if such game-play would be compelling to mainstream users, but according to the paper’s authors, the social aspect of these games can provide plenty of interest and incentive to keep playing. It’s a thought-provoking concept anyway, and it will be interesting to see what develops in this area, and in which niches these techniques will work best.

Demos + Posters

The demos were really good. From asking the other attendees, I think the favourites were QuiKey, a Quiksilver-like interface for entering and searching through triples (which won best poster); xOperator (a really intriguing combination of jabber, SPARQL, and Agents to bring you trusted answers to questions via Instant Messaging); OntoGame (described above); and Konduit, a Semantic Pipes-like application for visual programming for the Semantic Desktop (which made me reconsider my position in an irc discussion with iand about whether to describe application flow in code or data; it also won best demo).

Lightning Talks

The lightning talks, were, as ever a popular and light-hearted, yet thought-provoking event. The format was a tight 2 minutes, 1 slide, which was strictly adhered to: Andraž Tori of Zemanta gave a very good presentation that was roundly and deservedly booed when he tried to slip in 3 slides. At first I thought 1 slide would be a bit limited, but it was actually pretty good – giving each speaker a chance to present only one single idea at a time. All the talks were entertaining, but some that stood out for me were:

  • Jenny Green from the Ordnance Survey explaining that, in one database, they currently held enough data to overwhelm any triple-store in existence, and would need a large server-farm to store and serve it all.
  • Laurian Gridinoc advocating the use of RDFa

    Laurian
  • Andrew Green explaining How He Learned to Relax and Love the Bnodes: use them when you only need a ‘glue’ node that isn’t a ‘thing’ in its own right, and doesn’t deserve an identifier (I’m still not convinced: bnodes, bah!).
    ESWC2008 Lightning Talks - Dr. Strange Semantic Unit

The only great pity was that the lightning talks were run in parallel with other tracks, so I missed out on the start of the Applications track. Next year, hopefully they will run the lightning talks separately from the rest, and record them for posterity (the other talks had video-cameras in attendance).

Industry

ESWC is a pretty academic conference, but it was really interesting to meet people from other companies making great use of semantic technologies to their competitive advantage, like ProKarriere, an Austrian online recruitment service that uses tools like Crowbar and Solvent to scrape semantics from partner web-sites, together with Natural Language Processing, and ontologies they have developed, to intelligently match graduates up with appropriate vacancies. Or like Net7, who have built Talia, a semantically backed digital library for Philosophy Scholars (described in their Scripting Workshop paper). Or like Garlik, who had a whole keynote about them.

Interdisciplinary

A theme that appealed to me was using semantic technologies outside of Computer Science departments to aid scholars in other fields – there was quite strong presence from Finland with their work in the cultural heritage sector, visualising time and space. I also really enjoyed seeing Jun Zhao’s presentation about using off-the-shelf semantic software like Exhibit to help zoologists navigate a repository of research images, the Net7 guy’s demonstration of their Digital Philosophy library Talia, and hearing about Norman Gray’s work using RDF with astronomers.

While the quality of the presentations was really high, the best bits, as usual, were the socializing and informal discussions in between, meeting names I’d long been familiar with from the various semweb mailing lists, blogs and irc channels (#swig, #swhack, #sioc etc), and new people besides.

SWIG-Scotland

It was also nice to meet a few other people living in Scotland doing semweb stuff – there doesn’t seem to be that many of us. So I set up http://groups.google.com/group/swig-scotland in the hope that we can all arrange to meet up some time and talk triples (please join if it’s of any interest).

Looking over my copy of the proceedings, I realise that there’s so much stuff I didn’t see that I would have liked to (the tragedy of parallel sessions), and so much stuff I did see that I haven’t done justice to here – all the Nepomuk semantic desktop stuff for instance, or DERI’s research into sensors connected to the web, or the Vapour tool for testing HTTP conneg, or … but I have to stop now :) . Suffice to say, it was great, and I’ve got a lot to think about and try out over the coming weeks.