An Internet Archive for Data? A YouTube for data? Or something else?
![]()
Over on TechCrunch, Mike Arrington has a piece on a new service called Swivel.
Variously described as “The Internet Archive for Data”, or “YouTube for Data”, the premise is absolutely fascinating.
“…the site allows users to upload data - any data - and display it to other users visually”
“Uploaded data can be rated, commented and bookmarked by other users, helping to sort the interesting (and accurate) wheat from the chaff. And graphs of data can be embedded into websites. So it is in fact a bit like a YouTube for Data.
But then the real fun begins. You and other users can then compare that data to other data sets to find possible correlation (or lack thereof). Compare gas prices to presidential approval ratings or UFO sightings to iPod sales. Track your page views against weather reports in Silicon Valley. See if something interesting occurs.”
Great as this sounds, the potential for deliberate mischief or innocent confusion is truly immense. I remember the palpitations that fellow researchers in the GIS world used to suffer whenever anyone proposed overlaying two pieces of map data of (slightly) differing scale. Imagine the possibilities for discovering that nine out of ten cat owners have a cat, or that ‘everyone’ who voted for Bush lives in Florida… Just because you can combine two sets of data and produce an ‘answer’, doesn’t mean that answer has any value or meaning whatsoever. And at one or more removes from the data, who’s to know what’s ‘true’ and what isn’t?
Important as the capability to re-use data most certainly is, various scientific archives around the world are investing quite vast chunks of their budgets in documenting the detail and the premises behind each data set; when was it collected?; by whom?; why?; according to what method?; was there a sampling strategy?; what proportion of any given population was sampled?; how - if at all - were results normalised? And so on. And on. And, yes, on. Collecting data is hard. Manipulating data properly is also hard.
All of that said, I am truly intrigued by the possibilities when people are given a capability such as this to experiment and to explore their own data along with that contributed by others. The test, in my view, will be to see whether or not Swivel and its participants can find ways to meaningfully and accessibly answer the sorts of questions that will allow others to combine and recombine with confidence.
A further missing piece is, of course, the way in which data are adequately and properly attributed. Might something like the TCL play a part there, giving the owners of potentially valuable data the confidence to contribute it to the pool?
I look forward to seeing where this particular idea goes. Having worked with large - and often tightly controlled - sets of data at various points in my career, I’ve always pushed to make them more accessible. Maybe Swivel is what I was searching for all along. Maybe. I’ll not be able to say more until I’ve had an opportunity to try it for myself, but I certainly look forward to that day.
Looking at the comments on Mike’s post, can I echo ‘Dave‘ in asking;
“if the founders are reading this, i’d love to add one critical request (though it might already be in there) - user driven reviews of data integrity and reliability (e.g. ‘this data set has been marked ‘riddled with errors’ by 42 users)…likewise, will there be indications that the data set has come from an authority or expert resource (a la google coop)?”
[and yes, the founders were listening…]
And, in echoing some thoughts of my own, they’re clearly clever people…
Technorati Tags: open data, Platforms, Talis, TechCrunch













