Subscribe

ScaleCamp 2010

On the 10th of December 2010 a couple of us from the Platform Engineering team attended ScaleCamp 2010 at the Guardian offices in London. Very much like its bigger, older (second?) cousin Velocity, Scalecamp is a gathering of developers, operations folk and other people with an interest in scaling systems to support increasing numbers of data-hungry users in the post Web 2.0 age. Scalecamp aims to fill the gap for UK-based peeps who want to get in on the scalability chin-wagging and knowledge-sharing act. Smaller than Velocity or new-kid-on-the-block Surge, Scalecamp is now in its second year and still small enough to use the unconference format, allowing attendees to self-organise around whatever subjects float their scalability boats.

ScaleCamp

Pastries & Scaling your team

The day began with an empty timetable with slots for 40 minute sessions across 5 rooms of varying sizes. And some cheeky pastries. By lunchtime the board was pretty much full, with some intriguing sessions on the cards. First one to tickle my personal fancy was a discussion on how to scale teams. Talk of scaling teams made me remember the phrase “meat cloud”, which still makes me giggle. Like many engineering teams, we pretty much always have more work to do than we can get through, or at least get through for some value of “now”. Adding a good engineer or two (and if you’re a good engineer, we’d love to hear from you) would help us to go a little bit faster, and who doesn’t want that? So we’re certainly searching for the mythical “elastic meat cloud”; turn up the dial, add a few more people, and hey presto, you’re a team scaling guru!

Hmmmm, pastries!

The discussion touched on areas including technical architecture, how to attract and retain good people, and which working practices scale up best in different environments. We pretty much unanimously preferred a modular architecture to a monolithic “big ball of mud”. Loosely coupled components and services make it easier for multiple developers to work concurrently on the same system. An additional benefit is that you don’t need to understand the whole system before you can start to work on part of it, making it easier for new people to contribute earlier.

Good unit and acceptance test suites were also raised as technical concerns that can reduce the friction of adding new people to a project. The lurking fear of silently breaking something you don’t yet understand will certainly slow down new hires.

Handily, we managed to avoid any serious dogma wars while discussing process and methodology, although most of the talk was about various forms of agile approach and what size of team they scale to. Interesting to hear the experiences of people who had been using Scrum with teams of around 20 developers, which appears to be pushing the limits a bit, judging from their testimony. Also discussed was the question of when you need to start some form of line management, whether technical, admin-focused or both. How many people can usefully report directly to the same person? At what point does this start to become unworkable?

File Systems are shiny too!

Next up was a man standing in front of a room full of techies and inviting them to pull his system architecture to pieces. In a nice way. Richard Jones is building a browser-based IRC client that maintains user sessions even when the browser is closed. Richard outlined the requirements and characteristics of his app; append only (no edits), no joining between users, no search, allows users to download logs, page back to see chat they missed, and so on. His goal was to get some ideas to help him scale the app, which he expected may entail replacing the PostgreSQL back-end with something else.

The architecture currently uses table inheritance in Postgres to achieve vertical partitioning. There is one RDBMS table per day’s worth of data, so the data is basically sharded by day. This allows cheap deletes via SQL “DROP TABLE”, as opposed to “DELETE FROM”.

Shiny!

A brief discussion of various sharding strategies took place. The well documented foursquare outage was mentioned to illustrate the potential pitfalls of sharding randomly on user name; this can lead to hotspots in the cluster that can be tricky to manage. There was a certain irony in the fact that I was expecting this discussion to focus on one or more of the shiny new NoSQL databases as a replacement for Postgres, but ultimately it took a turn towards solutions that used good old file systems to manage data storage. Clearly we can also find shiny new work in the file system space too, but I suppose the takeaway here is to use whatever tool does the specific job you need, shiny or otherwise.

Analysing droppings using Hadoop

Matt Biddulph of Nokia hosted a session where he outlined work he has been doing to analyse massive datasets about cities. Matt described the process of collecting log files from assorted Nokia applications and analysing them as “inspecting their droppings”. Using these “droppings”, Matt has been able to do things like produce heat maps that visualise which map locations people inspect most regularly on their phones. In general terms, the approach he has used for this is to analyse these massive datasets in Hadoop, then take the resulting, much smaller data and load that into an RDBMS for querying. This approach seems to be the most popular one right now for finding interesting relationships and patterns in big data, although we were all hoping somebody in the room had been doing something different and funky we could learn about, analysing massive data in a more online fashion. Maybe next year.

Eventually Matt wants to be able to use Hadoop to calculate various types of ground truths offline, for example the “normal” number of active Nokia devices in the Notting Hill area. A comparison of streaming data against these ground truths could then highlight interesting patterns, for example how much busier are various locations in Notting Hill during carnival weekend? The possibilities of using the streaming data could extend even further, for example to answer questions like “Which bars in the area are currently too crowded to bother going to, and which are worth a visit?”. Now that’s an app I’d snap up from the Android market place without a second thought.

Gentlemen, let’s broaden our minds

As a developer who has spent most of his career working on various back-end applications, I enjoyed attending a couple of sessions that covered subject matter outside my usual domain. Firstly, Spike Morelli described a systems configuration approach to managing a cluster of several thousand nodes by using a config management tool to roll out only entire images. The QA department apparently loved this, because the release as rolled out was exactly the same as the thing they signed off after testing.

Secondly, Premasagar Rose hosted a session on design patterns for JavaScript performance. Topics covered included JQuery tips, caching data in the browser as JSON values, and making as few DOM calls as possible. A couple of interesting tools were mentioned in the form of jsperf.com and Web Inspector.

Fail at failing

I also enjoyed Andrew Betts‘ session on handling errors at scale. Although initially PHP focused, there was a lot of general wisdom covered in the discussion. People compared notes on logging strategies, monitoring tools, and assorted low-level nitty-gritty. One such hard-won nugget was the value of assigning a unique ID to each request in a distributed system so you can follow it as it moves from one component to the next. We have learned this the hard way here at Talis while attempting to trace SPARQL queries from the Platform web servers through to the RDF stores at the back-end. The “X-TALIS-RESPONSE-ID” header you see in your HTTP response to a SPARQL query is a unique identifier that enables us to see what went on with an individual request all the way through the Platform’s stack. Big Brother sees all, innit?

That’s all very well, but when do I get the X-Ray glasses & exploding cigars?

Scalecamp organiser Michael Brunton-Spall, who deserves enormous credit for his creation, hosted a session at the tail-end of the day. Michael introduced an approach used by the tech team at the Guardian to analyse a technical crisis after the event. The Analysis of Competing Hypotheses is a technique formulated by the CIA in the 1970′s to help identify a wide set of hypotheses and provide a means to evaluate each when looking for explanations of complex problems. Interestingly, there is an open source project providing software to help you do this. The CIA and open source – strange bedfellows indeed, no? Whatever next, the FBI opening a sustainable hemp farm?

A spy

To illustrate the process, Michael used a real example from the Guardian so fresh it was still warm. A week or so before Scalecamp, the Guardian’s website had slowed to a crawl just before a scheduled live Q & A with WikiLeaks’ Julian Assange. We were asked to shout out possible causes, e.g. “Denial of service attack”, “Too many comments on a page”, and so on. Then we attempted to think of what evidence would prove or disprove each. A lightweight version of the full CIA methodology. Our own root cause analysis usually incorporates the 5 whys, but ACH looks like another useful tool to have at our disposal. Plus, we get to pretend we’re spies, although we’ll probably stop just short of the water boarding.

Velocity 2010

Two Planes, an IT-related Sitcom, and a Shuttle-related Ruckus (a Shuckus?)

After setting off from Digbeth coach station some 25+ hours earlier, 2 tired Talisians (Matt and me) finally arrived at the Hyatt Regency, Santa Clara, California for O’Reilly’s Velocity 2010 conference for Web Performance and Operations. We first flew from Heathrow to Dallas Fort Worth on a surprisingly cramped American Airlines flight, then onto San Francisco; by the time we hit SFO we were looking forward to getting to the hotel to freshen up and sleep. Night had fallen as our plane approached SFO which made a nice change after nothing but daylight since waking up in Birmingham well over a day ago. Not a journey for vampires, I can tells ya.

On the flight from Dallas to San Francisco I sat next to Patrick Wilson, CEO of http://www.vitalsignstechnology.com/ who was on his way back from a conference in Florida. The Valley and Bay Area are chock full of people involved in one area or another of the Tech industry; you can’t swing a cat without hitting a developer, an ops guy and a couple of venture capitalists. And probably doing some damage to the cat.

IT Crowd's "Moss"

Patrick has a pretty extensive knowledge of the area and served as a high altitude tour guide of sorts as we made our approach over the lights of San Francisco (“There’s the Googleplex. That complex there is Sun” and so on) . He also revealed a recently acquired taste for the Channel 4 sitcom “The I.T. Crowd”, even going as far as to fire up his laptop and share an episode.  I was more than happy to help him with his query as to the “exchange rate between pounds and quid?”, given his obvious delight at one of our most surprising cultural exports.

We knew a cab from SFO to Santa Clara would run us well over 100 dollars; Irish genes kicked in and we started looking for a cheaper alternative, despite our by now zombie-like state of fatigue. At this point we discovered the magic of SFO “Shuttles”. These are basically big truck-like minivans that provide a (relatively) cheap way to get from the Airport to anywhere in the general Bay Area. When enough people want to go in roughly the same direction, you jump in and off you go.  Cheaper than a cab and organised to cover broad geographical regions e.g. South Bay area, there is an element of pot luck controlling how long it will take you to reach your destination depending on the route the driver chooses in order to encompass all stops.

SFO Shuttle

SFO employees are on hand to ensure fair play from Shuttle drivers. The lady we spoke to attempted to find a way to tactfully express the fact (“…it depends on…err… the… size of the people…”) that a family of very large Americans waiting expectantly inside the shuttle we needed meant there should really be 2 spare seats, but realistically there was now only 1 seat with access to a seatbelt. This was causing a problem.  Presumably potential litigation in the case of an accident was a worry. A brief Mexican (or at least Californian) stand-off ensued. SFO lady went to find a supervisor.

The driver attempted to slyly squeeze us in anyway now SFO lady’s back was turned. Oversized family took immediate exception to this blatant disregard for their safety (if only they showed the same concern for their health when passing a Dunkin’ Donuts). Dummies were spat, rattles were thrown, they retrieved their baggage and stormed (actually more like waddled) off in a huff (“You ****** up, Buddy! **** you!”). Too tired to find this episode as amusing as we really should have, we slid into the space vacated by their ample collective bulk and an hour or so later we were in Santa Clara, via Oakland. The driver was relying heavily on his sat nav, which caused some concern when it failed spectacularly. He switched it for an identical one that worked, prompting lame, tired jokes from us about redundancy and switching to a warm standby. First performance lesson of the trip: geek humour degrades dramatically after 26 hours without sleep.

A Serendipitous Meeting between Scribes

On Sunday we had dinner with fellow Velocity attendees Patrick Debois, and Torben Graversen. Matt met Patrick at Puppet Camp a few weeks ago. Patrick is the originator of the term “Devops”; his blog is highly recommended. As we chatted in the bar afterwards, Sean Power approached and asked if we were here for Velocity. He introduced himself and his friend Tracy Lee.

We chewed the fat for a while and talked lean start-ups, performance monitoring, Silicon Valley, and Sean’s upcoming talk at Velocity. Sean mentioned that he had contributed to an O’Reilly book that was due out any day now; Patrick asked which one. It turned out to be “Web Operations”; this book contains a chapter on Monitoring written by non other than one Mr. Patrick Debois! You could have knocked us over with an O’Reilly “In a Nutshell” book, such was the strength of the minor coincidence.

Lack of nerve scuppers a tour of Twitter (or possibly just a beating from security)

Since the conference didn’t start until Tuesday, we hired a car on Monday and drove to San Francisco. Whilst there we met with a couple of friends of Patrick, one of whom was something of a veteran of the Bay Area Tech scene. He told us he had calculated that there were at least 400 tech companies in a 3 block area of San Francisco; pretty mind blowing when you think of it. He also informed us the Twitter offices were just around the corner and suggested we should go in, ask at reception for his friend John Adams, tell him we were here for Velocity, (“Mike told us to ask for you”) and see if he would give us a tour.

We were all pretty surprised to find we were able to stroll into the Twitter building unannounced, take the lift up to the 6th floor, wander into reception and hang around without anybody once challenging us; I kept expecting to be thrown out any second. Ultimately, we were all far too European and reserved to ask for someone we didn’t know, tell him we were friends of somebody else we didn’t really know (“Does anybody know that guy Mike’s surname?”), and cheekily ask for a tour of Twitter, so we just hung around for a bit looking goofy and then left. So much for the meek inheriting the Earth; we couldn’t even blag a tour of Twitter.

Down to Business

Velocity traditionally covers two broad areas:

  • performance of Web applications
  • operations

At first glance some folks may not see the connection between these two topics, but they are increasingly intertwined as engineers seek to build highly available, scalable and fast applications that operate at Internet scale. Here in Platform Engineering & Operations, our development and operations functions work together closely in the same team, so it made perfect sense to us that these tracks had been combined into a single conference looking at performance in a holistic way. It also seemed fitting for us to send one developer and one ops person.

The conference was sold out, with over 1200 people in attendance, and up to 3 tracks at once at various times. Between the 2 of us, we tried to arrange our schedules to cover as many of the presentations and sessions as we could. Some of the sessions were billed as “workshops”, but in reality they were way too big to be anything other than long presentations; 400 people is far too many for anything “workshoppy”. Nevertheless, the content was generally of a very high standard; informative and well presented.

DO go chasing waterfalls…

Quite a number of sessions focussed on optimisation of applications that are delivered to the browser. Although not a problem we face directly in delivering the Platform (which is an API); this is an area that has come on in leaps and bounds over the last couple of years and it was interesting to see the current state-of-the-art. Annie Sullivan of Google gave a very good presentation covering many of the techniques engineers turn to when tuning performance of their web pages from the point after server-side processing is complete.

Waterfall charts are a common tool for analysing performance; during the course of the conference we saw many variations created by assorted tools including Webpagetest, Google Page Speed, DynaTrace, WebKit, Gomez, and a host of others. In fact, Steve Souders mentioned that there are twice as many of these type of tools as there were this time last year, which underlines the growth of this area of performance tuning.  Performance is arguably more important now than ever before, even Google page ranking is now partially dependent on the speed of your site.

Techniques mentioned by Annie and others included various ways to optimise and minify JavaScript, CSS & HTML, including getting JavaScript into a build system to help you identify dead code, code that can be modularised, and code that could be loaded asynchronously. Asynchrony, along with progressive rendering techniques to ensure the most important parts of the page load first also featured heavily in an exploration of how Facebook made their site twice as fast.

Engineering for the win!

One of my favourite presentations came from Theo Schlossnagle. Theo’s “Scalable Internet Architectures” was 90 minutes of wisdom covering a vast array of material, from analysing network packet size, to choosing between SQL and NoSQL databases, to version control, caching, monitoring, service decoupling, mastering tools and the importance of engineering maths. A truly wide-ranging and ambitious presentation, skilfully delivered. Unfortunately there appears to be no video, so you can’t really appreciate the moments when Theo worked himself into a righteous engineering rage as he dismissed various bone-headed architectural decisions. However, the slides are still well worth a look.

How do they do that?

Undoubtedly the most over-subscribed session of the week was “A Day in the Life of Facebook Operations” by Tom Cook; I literally had to watch this one while standing in the doorway to the lecture theatre. The room was full to bursting point and Tom did not disappoint. The sheer scale of the job at Facebook is daunting; more than 400 million users, 10s of thousands of servers, 300+ TB of data served from RAM alone via Memcached, and multiple software releases and configuration changes every single day across this gigantic stack. A great example of operating on a massive scale and yet still moving quickly and keeping risk small.

Similarly popular and insightful were John Adams’ “In the Belly of the Whale: Operations at Twitter“, John Allspaw’s “Ops Meta-Metrics: The Currency You Use to Pay For Change” and Paul Hammond’s “Always Ship Trunk: Managing Change In Complex Websites”. All of these presenters have real, in-the-trenches experience of managing development and operations in very large, very fast moving Web applications, servicing mind-boggling numbers of users via staggering amounts of code and infrastructure. Much can be learned from them.

Dev what now?

I have become increasingly aware of the Devops movement over the last few months. I believe this kind of thinking has the potential to change the face of Operations the way agile approaches have changed Software Development over the last 10 years, so it was good to see it well represented at Velocity. I particularly enjoyed Andrew Shafer’s “Change Management: A Scientific Classification”, which sounds almost like it could be espousing a very buttoned-down, paperwork and process heavy approach to managing change, but in fact stresses the importance of agile thinking (high-bandwidth communication, version control, small changes deployed regularly and monitored heavily, automation and configuration management tools) in safely managing change. Adam Jacob also touched on Devops during his innovative “Choose Your Own Adventure” session.

There is no spoon

Wedged in amongst all the good stuff on performance in the browser there were a couple of sessions that took different approaches to looking at performance. Firstly, there was Yahoo Search’s Stoyan Stefanov with “The Psychology of Performance”, offering fascinating insights into how humans perceive the duration of various things and what that means for web applications.

Secondly, Neil Gunther and Shanti Subramanyam used performance testing analysis of Memcached in “Hidden Scalability Gotchas in Memcached and Friends” to introduce Neil’s Universal Scalability Law and explain what mathematical modelling can do to help performance tuning in the Brave New World of multi core machines. This was truly eye-opening stuff; the material was accessible enough to pull you in, but deep enough that I will be digesting bits of it and delving into this further for a long time to come. It was also good to see that server-side performance was being addressed at Velocity, albeit on a much smaller scale than the browser-side.

A recurring theme for me was the additional material Velocity has pointed me towards; the performance-related blogs of Neil and Shanti being great examples.

What has all that got to do with the Platform?

Common high level threads amongst all these cool kids on the Tech block were being process light but review heavy and making frequent small changes with enough testing, automation and monitoring around them to keep the risk of change minimal, yet keep the pace punchy. I found it encouraging to see how much of this stuff we already do in Platform Engineering & Operations, e.g.:

  • version controlling everything (Subversion and Git)
  • always shipping trunk
  • using configuration management tools (Puppet)
  • stressing peer review
  • extensive automated testing (J-Unit, Grinder)
  • monitoring and alerting (Ganglia, Nagios, Cacti, etc.)
  • Continuous Integration (Hudson)
  • dark deployment
  • service decoupling
  • using switches in code to enable/disable features
  • frequent small releases
  • appropriate use of asynchrony
  • judicious use of cloud technologies (EC2, S3 and various other bits of AWS)
  • having ops and devs work closely together.

We don’t yet face the problems of scale that have led Facebook and Twitter to turn to BitTorrent as a means to roll out software quickly to thousands of servers (that would be one of those “nice” problems to have, given what it would represent in terms of take-up of the Platform), and we have some way to go before we can truly say we deploy continuously. However, I left feeling confident in the way we work, primed with new areas for us to explore, and inspired at having gained an insight into how some of the leading lights of Internet-scale engineering make it all hang together.

Puppet Camp 2010

In May myself, Amanda, and Rich attended Puppet Camp 2010 in Ghent.  Here in the Platform team we have been using Puppet for Infrastructure Configuration Management for just under two years.  We first started to look at Puppet when we required an automated process to deploy software to ~100 virtual nodes in Amazon EC2, we knew straight away that Puppet was a cool tool and that it would make our lives as engineers a lot easier!  Soon after we implemented Puppet in our testing and staging environments before finally using Puppet in production.

So what happened @ Puppet Camp? well it was really a chance for us to connect with other Puppet users, the developers of Puppet, and to have a voice in the direction of where Puppet is going.  Patrick Debois did a wonderful job organising the event, and really got discussions going by choosing the open spaces format.  We heard that the next version of Puppet 0.26 will be known as version 2.6 and includes a huge list of new features.  Puppet 2.6 will feature a full REST API, pure Ruby DSL, Run Stages (resource/class ordering!), and Class parameter passing.  These are all really cool features that will provide even easier methods to automate our infrastructure.

A good talk was given by Jeff McCune on Auditing Change Management Policies with Puppet and Splunk, slides here.  This was a live demo (that worked!) on tracking a change from the commit in Git all the way to the execution by Puppet.  This is definitely something we will be looking at using to streamline our own change management process.

Last but not least Puppet Forge was announced.  This is a publicly available repository for users of Puppet to share common modules for installing and managing software such as apache, nginx, sudoers etc.  I think we’ll be contributing modules ourselves when we come round to rewriting them for version 2.6.

It was refreshing to see Luke Kanies the CEO of Puppet get heavily involved with discussions during the open spaces sessions, and really gave the whole event an honest and open feel.  This was an excellent two days, if you’re looking for something similar coming up then keep an eye out for DevOps Days EU