by Lee Feigenbaum and Mike Cataldo
|This article features in Nodalities magazine, Issue 7
As the old adage goes: Time is money.
Ultimately, information systems are about saving time. One could argue that technology enables analysis that facilitates competitive differentiation or improved product quality, but the fact of the matter is that these things and others could all be done without computers; they would just take much, much longer.
A lot has been said and written about information overload. Ultimately, though, the issue with ever-expanding data is that the data we need becomes hidden in mountains of other data. Typically, these mountains take the form of relational databases where the data is neatly stored in rows and columns, and we find the data in one of two ways. Either we directly look up data by its “address” within the database, or else we use a simple text search. But if we don’t know what table or column the data resides in, we can’t look it up. And as the quantity of data grows, text searching the mountain of data itself yields a mountain of results. Combing through these results then compromises the real benefit of information technology: time savings.
This leads to the greatest challenge facing IT organisations across industries: how to provide users the data they need when they need it, visualised in a way that is understandable and useful. Or put more simply: get the right data, for the right people, at the right time. Traditionally, this is much easier said than done, as the data lives in multiple databases, exists in various formats, and no user interface exists to present the information in a way that is helpful to the user.
Typically, the approach to solving these problems involves some sort of data warehouse. Atop the warehouse, we’d probably deploy a business intelligence (BI) solution to surface the answers to common queries to the people who need them.
Another tactic might be to install a document management system that stores documents in a central repository, where employees can use search and basic metadata to better locate individual pieces of information.
Or we might build a portal to allow people to view the right data from multiple silos in a timely fashion. By defining a collection of portlets as views into specific sources of data, we can provide a one-stop location for people to view information from business-critical data sources.
Pursuing any of these typical solutions means spending 6-18 months at a time solving a single problem. And even worse, all of these approaches are doomed to obsolescence from the start. As requirements change, the fixed schemas and the complex ETL processes inherent to data warehouses must be recreated from scratch. The canned queries and views that define BI- and portal-based approaches must be constantly re-evaluated. And the limited search and query capabilities of a document management system mean that new requirements demand a new installation.
In short, traditional approaches all suffer from the dreaded Shampoo Syndrome: the only workable long-term solution is to constantly lather, rinse, and repeat. And when we do, we just create another mountain of data, another place where what we really need can hide.
The solution is to find data by its meaning rather than its location
The key to eliminating many of the inefficiencies of today’s information technology solutions is to access data by its meaning—what it is—rather than its location—where it is. With meaning, we can quickly find what we need simply by describing what it is. This enables information to be shared and consumed at the data level, a paradigm known as data collaboration.
With data collaboration, the data is much more granular, more accessible, and more consumable. In contrast, data warehouse, BI, and portal solutions, in addition to contact tracking (CRM), supply-chain management (SCM), employee management (HR), and all-in-one enterprise bundles (ERP), all fall into the category of data containment. While these applications (commonly known as data silos) excel in capturing extremely structured data, they make it almost impossible to get the data out to be re-used by other users and in other applications.
Document management systems, on the other hand, attempt to make information more shareable, but essentially end up creating many mini-silos in the form of Word documents, PDFs, Excel spreadsheets, or Web pages. This is the world of document collaboration, in which information is readily shared, but the data we need is locked within the min-silo.
Data collaboration is the best of both worlds. By combining the ease of access to information that is the hallmark of document collaboration with the highly structured nature of data from data containment solutions, we can begin to answer the IT challenge. The key to success is to ensure that the meaning of every data element is surfaced so that it can be easily accessed by any person or application that needs it.
Data Collaboration and the Semantic Web
It’s no coincidence that the technology standards developed over the past ten years in support of Tim Berners-Lee’s vision of a Semantic Web are the key elements for building data collaboration solutions. For as with data collaboration, the Semantic Web relies on explicitly capturing the meaning of data. As such, the core Semantic Web standards pave the way for:
- Flexible, define-as-it-arrives, data structures
- Explicit relationships that travel with the data
- Data that is accessed by its definition rather than its address
- Distributed query
As with all standards, Semantic Web technologies lay the groundwork that makes improvement possible. It is up to application developers to build solutions that make the standards practical.
Practical Data Collaboration to Solve IT’s Challenge
Cambridge Semantics is one of the first companies to develop practical business solution enablers based on Semantic Web standards. In short, the Anzo products allow businesses to layer a semantic fabric over existing data that:
- Virtualizes the data so that it is accessible by its description regardless of location.
- Lets users create their own views of data.
- Fills in the views by traversing the fabric and picking out the relevant information.
- Keeps everything in synch by allowing updates that occur anywhere to update information everywhere.
The Right Data…
At the heart of the Anzo suite of products is the Anzo Data Collaboration Server. This acts as a central gateway that provides a consistent interface for applications to read, write, and query RDF data, regardless of the actual source of the data. While RDF provides the flexibility to incorporate new data as it is virtualised, it’s all for naught without the proper adaptors for existing data sources. To facilitate access to the right data, the Anzo Data Collaboration Server can connect to data sources including LDAP directories, HTTP-accessible Linked Data, and standard relational databases.
But perhaps one of the most useful connectors is Cambridge Semantics’ Anzo for Excel. With Anzo for Excel, data inside spreadsheets with arbitrary layouts can be linked into the Anzo Data Collaboration Server. By breaking down the walls of spreadsheet mini-silos, Anzo for Excel weaves information from thousands (or more) spreadsheets scattered across a business, dramatically increasing the availability of the right data.
…For The Right People
Getting the data in front of the right people relies on three things: context, security, and “reach”.
Context. It’s not enough simply to have the right data. People must have access to views of the data that depict exactly what they need to see, whether it be an executive dashboard, a regional summary map, or a customer-by- customer detailed report. Cambridge Semantics’ visualisation product, Anzo on the Web, allows the same information to be rendered in many different ways via semantic lenses. Lenses provide context-appropriate user interfaces to render a particular type of data, meaning that the right people see the right data in the right way.
Security. In many ways, security is the converse of context. While context ensures that the right data surfaces properly to the right people, robust security makes sure data does not surface to the wrong people. The Anzo Data Collaboration Server provides security by layering a role-based access control model atop the semantic fabric. All data access is gated through this security model, which defers to the permissions schemes of legacy data sources where appropriate. The result is that only the right people can ever see (or change) the right data.
Reach. The right data needs to be able to be brought to the right person, whether that person is a technical staff member, a line-of-business manager, a “power user,” or a senior executive. As such, the software must be within reach of all users, without the need to call on IT. Research analysts must be able to collect and share spreadsheet data themselves. Anzo for Excel reaches these users by allowing spreadsheets to be visually linked with just a few clicks. Supply-chain managers must be able to drill through data on warehouses, suppliers, and distributors on their own terms. Anzo on the Web reaches these users via a simple and customisable faceted browsing paradigm, whereby anyone can add their own filters, add their own lenses, query their data however they like, and save the results to re-run later or share with colleagues.
…At The Right Time
Finally, it’s not enough to just bring the right data to the right people. It also needs to be done in a timely fashion.
First, data access against existing data sources is accomplished via federated (distributed) query. SPARQL is explicitly designed to enable queries that access multiple data sources at once, and the Anzo Data Collaboration Server includes a SPARQL engine that does exactly that. By querying the source data directly, Anzo eliminates the cycle time typically associated with a data warehouse’s ETL processes.
Second, data updates performed via the Anzo Server are broadcast out in real-time to anywhere the data resides. This means that if a value is changed in a spreadsheet cell, the value instantly updates anywhere else it appears, including Web pages or within a relational database. This is essential as many spreadsheets, Web pages, and databases will share the same piece of data with confidence as semantic tools are made available to users across the business enterprise.
Data Collaboration in the Days to Come
Imagine a world in which this challenge has been solved. End users—whether knowledge workers, line of business managers, or executives—can simply draw a picture of what they want to see and then choose the data that should fill in the picture. Within minutes rather than months the right data shows up on the right people’s screens. Now imagine that the data is live as well: you make a correction to the data and your changes are reflected in real-time in whatever legacy database or application the data comes from. You’ve managed to maintain a single source of truth for your key information assets, while still preserving existing investments in legacy systems and applications.
What sounds miraculous is possible today, in software such as Cambridge Semantics’ Anzo. By combining the revolutionary enabling capabilities of Semantic Web standards with solid, practical engineering, we open the door on a completely new paradigm for enterprise software: data collaboration.
Lee Feigenbaum is VP of Technology and
Standards and Cambridge Semantics and cochairs
the W3C SPARQL Working Group.
Mike Cataldo is currently CEO of Cambridge
Semantics and a veteran of multiple technology
start-up companies.