« Nova Spivack talks with Talis about Web 3.0 and more | Main | Roll those sleeves up, and get on with the Web 2.0 hard slog »
23 March 2007
Toward the Web of Data
Posted by Paul Miller at March 23, 2007 11:34 PM
All around us, data-rich applications are moving out of the enterprise and into the (network) cloud. As they move from intranet to internet, expectations as to the manner in which dispersed and disparate data should be available for use and reuse are evolving rapidly, and challenging the previously robust models of database design that have served the traditional enterprise so well for many years.
Let us explore some of these changes in technology and in expectation, and propose a solution more suited to manipulating heterogeneous data spread throughout the cloud; a solution engineered from the outset to fulfil the requirements of that amalgam of ‘Web 2.0’ approaches and ‘Semantic Web’ technologies popularly labelled ‘Web 3.0’.
Let us reach toward the prospect of realising the world wide database, built upon Semantic Web principles and well-established web protocols, and incorporating the same massively scalable, partitioned and decentralised architecture as the web itself. Let us support distributed manipulation of structured, unstructured and binary data, served up via an open source semantic web server that allows existing content and databases to be woven into the emerging web of data with ease.
History is littered with examples in which the previously innovative and differentiating become commoditised. For a short time, Thomas Edison’s companies monopolised the supply of long-lasting incandescent light bulbs - and electricity - before their competitors caught up and moved competition to a new level. In the automobile industry we see repeated examples in which high-end differentiators such as power steering, airbags, or satellite navigation progress from concept through high-end to mainstream, and the same pattern can be seen in the rapidly changing mobile phone environment with increasing penetration for cameras, music players, and now navigation systems. With each advance, financial and technical barriers fall and innovation and competition move to a new level, built upon now commoditised foundations.
On the internet, there is significant interest in providing the 'platform' foundations upon which the next wave of innovation will be built. We continue to see computer operating systems such as Microsoft Windows, Mac OS and the various flavours of Unix subsume and mediate the complex interactions between hardware, software, and user, offering increasingly rich yet intelligible hooks to which developers can pin their own applications. The operating system is an increasingly capable platform, and it is one that succeeds in binding developers and users around it in a complex yet largely self-sustaining ecosystem of mutual benefit. A level down, hardware providers such as Intel work harder to bind customers to their platform by building additional capabilities such as digital rights management (DRM) or graphics processing into their hardware. A level up, and increasingly squeezed as the operating system seeks differentiation by ‘bundling’ applications and capabilities previously provided by others within the surrounding development ecosystem, sit applications developers. In an attempt to increase their value proposition, and to bind their customers increasingly tightly, these companies too are making a play to be recognised as platform providers within their particular vertical market. Oracle, Salesforce, and similar companies are prime examples of this trend, and it remains to be seen how successful their efforts will prove in the longer term. Apple could also be considered to seek a platform position in the media space; the marketing and (relative) affordability of their iPod range of devices is focussed upon binding consumers ever more tightly to the far more lucrative iTunes store, and recent announcements see this store extending its reach into the living room, onto the mobile phone, and beyond.
Above the level of such network fundamentals as TCP/IP, we have yet to see the emergence of a truly generic platform on the internet. Providers of vertical applications increasingly describe their technologies as platforms, whilst remaining focussed upon the needs of the market segment (finance, libraries, customer management, etc) from which they emerged. However, if the picture painted so long ago by Tim Berners-Lee and others of a Semantic Web is to be realised and put to work in underpinning the next generation of b2b, b2c, c2b and c2c applications then a new platform is required; a pragmatic platform engineered to facilitate timely, robust and relevant interactions with semantically rich data, densely interconnected yet physically distributed throughout the cloud and subject to diverse models of ownership and control.
Work around the Talis Platform effectively and demonstrably bridges the divide between the rigorous ontological approaches of the Semantic Web and the free-form exuberance associated with the cream of Web 2.0 to offer a means by which the best of both worlds can be brought together in delivering a means by which data may be manipulated and consumed flexibly, reliably, and powerfully. Internet-strength generic approaches and components are combined with the capability for vertical market specialisation, with existing community standards and specifications employed to ease adoption, engagement, and ongoing development.
The history of the web is littered with examples of great ideas, poorly executed or never adopted sufficiently to ensure the emergence of a sustainable community. There is a fundamental - and increasingly significant - difference between that which new technologies make possible and that of which those same technologies are capable once their adoption passes some critical point on the curve.
Take-up of web-based behaviours continues to grow apace, as high speed and always-on connections to the internet become ever-more ubiquitous and affordable. Users of these connections behave very differently online, taking advantage of an effectively instantaneous availability of distant resources to blur the boundaries between resources on their own computer and those remote to them. In turn, we see the sites to which these users flock themselves changing, adapting to the ways in which evolving behaviour fundamentally alters interactions between a site, its users, and their broader community of interested peers. These social interactions, so closely aligned with the current enthusiasm for ‘Web 2.0’, are part of a significant shift in the web and the behaviour of those using it; we are seeing a quite unprecedented move from often passive consumption toward active engagement and participation. This participation is manifest in many forms, including the explosion of blogs, the popularity of Wikipedia, MySpace’s position high in the global rankings for web sites, and the value placed in customer reviews of items on Amazon.
Whilst these trends continue to be significant, and point toward an online experience in which participation - if you wish it - could become as easy as consumption, the current generation of participative services face difficulty reconciling the competing needs of a user desire to use and reuse their content - and even their clickstream - in diverse scenarios of relevance to themselves, and a business imperative to monetise that engagement, often by constraining the ways in which outputs from one application might meaningfully be enmeshed with outputs from another (potentially competing) service. The user might wish to offer a single ‘wish list’, capable of fulfilment via either Amazon or Barnes & Noble, but how does such an apparently reasonable desire fit with the current business model and infrastructure of either retailer?
Increasingly, we are coming to recognise the value of online communities. This value is not new, but recent technological and practical advances have made it far easier for those communities to form, and to grow. The purchase of expensive web properties such as YouTube and MySpace have largely been about seeking ownership of a community rather than any technology upon which they may depend. Although they would certainly appear to be reaping dividends at present, such purchases are inherently risky; web-based communities may form with remarkable rapidity, but they are equally capable of dissolving - or moving elsewhere - just as fast.
At the moment, sites for which data are fundamental (whether participatively assembled or procured from some third party provider) face a difficult challenge. As has been remarked previously, the cost of initiating a technology start up has fallen dramatically, largely because of the rise of commodity hardware, the widespread availability of network bandwidth, and the quality of open source software upon which to build. No such trend can really be observed for the data upon which so many of these sites depend, and (with some exceptions) the APIs that provide access to much of the value locked within these applications still tend toward the superficial. As such, companies are required to invest significantly, either in procuring content with which to seed their service, or in covering their ongoing costs whilst a community of use gradually grows around them, seeded by early adopters who see the potential and are not dissuaded by an early paucity of content. By lowering barriers to the controlled sharing of data, and by underpinning alternative business models in which it need not be necessary to restrict reuse to the degree we often see in current applications, we recognise that value is shifting, that costs need to be driven out of the mundane work of gathering, exposing, orchestrating and moving data, and that new opportunities present themselves, both for meeting a user’s desire for data portability and a company’s desire to meet their costs and make a profit.
As Google CEO Eric Schmidt is reported to have said at the 2006 Web 2.0 Summit,
“The more we can, for example, let users move their data around, never trap the data of an end-user, let them move it if they don't like us, the better.”
It is important to note that most of the necessary technology and standards already exist. The issue is primarily one of adoption, and of disruption to existing modes of operation, allied with the capability to assemble the pieces in such a way as to facilitate accelerated deployment.
Moving beyond partisanship to the Semantic Web, Web 2.0, or any of the other labels currently prevalent within the technology space, our work on the Talis Platform adopts a pragmatic approach to providing a robust and cost-effective set of infrastructure services upon which a wide range of applications may be built, both within and across vertical market segments.
Today, it is difficult for applications to fully realise the potential of structured data distributed across the wider web, yet if we are to approach the opportunities presented by the Semantic Web - the Data Web - these obstacles must be overcome, and that promise must move from the world’s research laboratories to the web sites that we all frequent on a daily basis. There are numerous reasons for these difficulties, but they might principally be considered as;
- diverse data structures
- scalability
- performance issues, given the lack of access to common indexes
- the absence of an effective means of uniquely and reliably addressing specific searches or their results
- a restrictive attitude to data sharing.
Fundamentally, rapid and complex interactions with data remain the preserve of the database; a carefully constructed and heavily optimised application that depends for its success upon the generation of indices far in advance of any query being submitted. A search through a large database for a single piece of information does not, in fact, mean the software examining each record in turn until those matching the query are found. Instead, pre-computed indices that have been generated in order to allow the rapid narrowing of a search are queried, and a significantly smaller set of records is then selected for closer examination in answering the query. Similarly, in searching for a street address in London, the searcher does not move from house to house, linearly. Instead, they consult the index of an A to Z guide, find the page on which the street is mapped, then find the grid square within that page. Upon arrival in the physical street, the searcher is unlikely to start at one end of the street and continue to the end, checking every house until they reach the address they seek. Rather, prior intelligence is applied to work out roughly how far along the street a particular number - 100, say - is likely to fall. The searcher moves rapidly to the approximate area, and then examines houses more carefully in order first to validate their guestimate and second to locate the exact property they seek.
Databases work very well, but they continue to rely upon this pre-computation of indices and, as such, they tend to require both detailed prior knowledge of any structure to the data they hold and reasonably tight control over the movement and evolution of the data. When individual records change, indices need to be updated or recalculated. In large databases, where more than one index is involved in some complex intersecting query, even the physical location of individual records and indices becomes highly significant, as those minuscule delays required to pass information from one part of a network to another rapidly add up. For these reasons, and more, the effective database has remained very much outside the network cloud. Web based applications certainly query databases, but those queries are passed - in their entirety - to databases that discover an answer and then report it back out onto the web. All of the internal processes that turn a query into a result take place offline, and this reality of modern database design ensures that interaction with data - as envisaged in the Semantic Web - is currently incapable of taking place within the cloud at scale whilst retaining acceptable rates of performance.
Consider, instead, a notional ‘database’ running inside the cloud. This database is agnostic as to data provenance, format and structure, and is exposed via a set of discoverable web services in such a way that any third party is able both to declare their own data table and to expose it to others for query and other manipulation. The name of each table is uniquely and universally addressable and each item within the table - each data record - is itself a logical refinement of that address. Now anyone can create a table which has pointers to data in any other public table, and those pointers create a web of intermeshed data. Any table can be queried if you have the appropriate permissions. Sets of tables can be combined in a union query or turned into a materialised aggregation or view. There is a special web of data query which, if given a dataset, will retrieve other records by following the web of data pointers. In this way you can navigate around the web of data from record to record much like you can between documents by following hyperlinks on today’s web. As well as offering access to the original data, it becomes feasible to offer direct access to the aggregations, and for those aggregations to evolve as the underlying data upon which they are based changes. Permission to query or otherwise use tables you do not own is based on the sharing permission of those tables and may require a subscription or contract with the data owner, or the data may be under an open data license or otherwise free to use.
Now consider that individual items in these notional database tables can in fact be structured data like XML or RDF, semi-structured like HTML, or binary like a JPEG image. This means the database can operate as a file system with the ability to add metadata to its contents. The database is optimised for extremely rapid query and data mining rather than for update transaction processing. Queries can be fired into the database as simple REST URLs by pasting them into a browser and the data is returned as RSS or JSON for easy integration into a third party application’s interface.
Existing applications and databases can be connected into this world wide database via a ‘semantic web server’, allowing data in a legacy system to be made available as part of the wider web of data. The world wide database has the concept of fully managed tables (the W2DB has the master data and handles updates) or partially managed (where the W2DB is synchronised with the external master system) or a remote system which can only support certain query modes.
The Talis Platform supports the creation of domain-specific data platforms on top on this core W2DB. Once available, the data can then be orchestrated with domain specific web services to create powerful applications very simply; so simply, indeed, that website owners and bloggers can build applications as well as professional software developers.
The Talis Platform is a domain-neutral technology base for building distributed information services. Groups of these services can be combined to form vertically specialised platforms such as that already being demonstrated in the library domain. Instances of the Platform services may be deployed privately to provide for the needs of some niche application, or exposed publicly for leveraging in an open and global context.
Contrast the generic Talis Platform - a horizontal Information Platform - with its domain-specific variants as vertical Application Platforms. The former may be thought of as an Information Grid providing abstracted facilities for storing, distributing and coordinating large quantities of information, whereas the latter brings the additional complexity - and opportunity - of domain knowledge in the form of specific data structures, business logic, etc.
The Talis Platform is intended from the outset to be modular, scalable, and extensible. Each core component is exposed via a set of documented APIs suitable for communicating with other Platform services, or enmeshing within some third party application independent of all other areas of the Platform.
To be continued...
Technorati Tags: open data, Participation, Platforms, Semantic Web, Talis, Talis Platform, W3C, Web 2.0, Web 3.0, web services, web2con
Trackback Pings
TrackBack URL for this entry:
http://blogs.talis.com/mt/mt-tb.r280.cgi/780
Comments
Paul,
interesting topic.
I'll await 4 part 2 but wondering where Talis differs from the XML and OO so-called "open" databases that proport 2 having the same capabilties.
The 'verticalisation' and 'knowledge' bundling/rewrapping of data within the databases seems like u're onto something there. But how 2 make easy 2 use ....
Lal
Posted by: Lal at March 30, 2007 06:13 PM

