Subscribe

Searching the BBC Data in the Talis Platform

I’ve previously blogged about how easy it is to create a custom search index using the Platform. So obviously during the process of loading the BBC programmes and music data into the Platform we’ve used this feature to build a search engine across their data.

In this post I wanted to show a few example queries and then review how we’ve configured the search indexes so you can not only get the most from the feature, but also see how it can be used against real-world data.

Sample Queries

Here are some sample queries. The Platform is more of a search engine tool-kit than a search engine per se: the results aren’t a human-readable web page, they’re an RSS 1.0 document that contains enough structured metadata about each item in order to build a presentation of the results. And where additional metadata is required, this can be extracted using the describe service, additional searches, augmentation or a SPARQL query.

However for the purposes of this article its enough to view the example in your browser. Application developers will want to dig into the underlying markup to see what extra data is included.

  • A search for “Banksy
  • A search for “The Prodigy” — returning the artist, the dbpedia entry, and episode titles and descriptions in which they are mentioned
  • A search for “Terry Pratchett” — again produces a mixture of different types.
  • A search for “Prodigy” limiting to things that are of type “”http://purl.org/stuff/rev#Review” — Results.
  • A facetted search for “Prodigy” grouping the results based on their RDF type — Results. This shows us that we have results in not only episodes but in a variety of other types too. We can drill down these into form the following search:
  • A search for “Prodigy” limits to Music Segments. Results.

If you want to try out your own queries, then use this simple form.

The Configuration

To show how we’ve configured the Field Predicate Map and Query Profile for the BBC Backstage store, I’ve uploaded them to our public SVN: fmap.rdf and queryprofile.rdf

Looking at the Field Predicate Map, you can see we’ve configured the Platform store to index the key predicates in the BBC data, including titles, labels, descriptions and synopses. You can use any of the named fields in the configuration to refine searches to specific predicates in the data, allowing construction of an “advanced search form”. E.g. we can search for name:”Stephen Fry” to search for a person called Stephen Fry (results).

The RDF type property is also included in the Field Predicate Map to allow us to limit searches to particular types of resource, it also enables us to do facetted searches based on type, giving us an alternate view of the data. Its easy to see how that functionality could be used to help build some useful additional options to restrict the search results presented in a user interface.

To configure the relevance ranking we chosen to boost hits in “labels” (names, labels, titles) over “descriptions” (description, synopses, review). We could easily change the boosting to favour one or other type of predicate to further tweak the results. But this configuration provides a reasonable set of search results for the tests we’ve done. Let us know how you get on and whether you think any of this should be changed. We’re happy to alter the configuration to make sure that people can get the most from the BBC data.

Building a Custom Search Index

The Platform is more than just a triple store with a SPARQL interface. It provides a number of other services which are useful for application developers. The most useful of these is the built-in search engine. Each Platform Store has its own search engine that can be used to perform queries over the hosted metadata. So as well as having the option to query your data using the SPARQL query language, you also have the ability to do simple queries over the data with results being returned as RSS 1.0 (with the OpenSearch extensions). This is a nice feature as sometimes you don’t need the full power of SPARQL and for some use cases a more specialized text indexing system is a better option.

The Platform API allows you to configure the system to build a full-text index over any or all of the RDF literals in your stored data. The exception to this the RDF type predicate, this is the only predicate that will have resource values indexed, making it possible for you to construct a search index and queries that can be used to find matches in specific types of RDF resource.

The remainder of this post shows how to configure the Platform to build a custom search index, with example Ruby code using Pho.

Its common in search engine syntax to use a simple friendly name to identify a specific field that you want to search. For example in a Google search you can use “intitle:Blah” to search for the text “blah” only in the HTML title element of indexed pages. The Platform uses a similar mechanism to allow you to map any RDF property URI to a short friendly name suitable for submitting in a search query.

The complete set of these mappings are referred to as a FieldPredicateMap. The mapping is specific to each store, allowing different stores to have their own mappings. The Platform API exposes these mappings allowing you to retrieve and update the mappings yourself.

It is the presence of a mapping of a property URI to a friendly name that triggers the Platform to start indexing the literal values associated with that property. To put this another way: all you have to do to start indexing your literals is define a mapping. Its a simple as that.

Once a mapping is in place, whenever you submit some RDF/XML to your store, the Platform will automatically index all of the mapped triples. The indexing is done asychronously so there might be a short delay between the deposit of new content and the indexes being updated. Standard stuff.

The Pho Ruby API for the Platform provides programmatic access to this functionality, allowing you to script up the management and creation of the FieldPredicateMap. See the rdocs for the FieldPredicateMap class for details.

Here’s an example Ruby script that illustrates how to manage mappings.

To run the script you’ll first need to fill in the name of your store and your admin username and password. You’ll also need to make sure you’ve installed Pho: gem install pho should do the necessary.

The script does several things. Once a store object has been created, the script creates two new mappings. One for the FOAF name predicate, and one for RDF type:


#create the mappings we want
name = Pho::FieldPredicateMap.create_mapping(store, "http://xmlns.com/foaf/0.1/name", "name")

type = Pho::FieldPredicateMap.create_mapping(store, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "type")

The create_mapping method allows you to quickly generate a mapping suitable for adding to a specific store. In order to fetch the current list of mappings the script then does:


#read the existing mappings
mappings = Pho::FieldPredicateMap.read_from_store(store)

#remove anything for this uri
mappings.remove_by_uri("http://xmlns.com/foaf/0.1/name")
mappings.remove_by_uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

#append the new field-name mappings
mappings << name
mappings << type

The read_from_store method does the actual work, doing a GET request to the Platform to retrieve the mappings as JSON, which are then parsed into some useful Ruby objects. The remaining lines then add the newly created mappings to the current collection, after first ensuring that any previous mappings for those URIs have been removed. At this stage we’ve updated our local copy of the mappings but have not yet saved them back to the Platform.

Storing the updated mappings in the Platform is then just a matter of calling the upload method on the mapping object. This serializes the list of mappings as RDF/XML and then PUTs them back to the store. This will overwrite any of the current configuration with the updated copy we’ve got locally: this is one reason why we fetch the current copy before making the changes, to ensure the rest of the configuration is preserved.


resp = mappings.upload(store)
if resp.status_code != 200
  abort("Failed to upload mappings!")
end

The upload method, like many of the lower-level method calls in the Pho library return an HTTP::Message object that you can inspect to determine if the Platform request was successul.

The remaining lines in the sample script simply upload some test data to your store: astronauts.rdf contains a short list of a few astronauts, modelled as simple foaf:Person instances with a foaf:name property. This allows you to test out your newly created search index.

You can now construct item searches with syntax like “name:Buzz” to search for the name Buzz in any foaf:name predicate. Or you can find all foaf:Person instances by performing a search for:

type:"http://xmlns.com/foaf/0.1/Person"

Note that you have to quote the predicate URI. And you can obviously combine those to find only foaf:Person resources with a specific foaf:name.

I’ve run the script against the n2-examples store, so you can use the item search form to test it out. Or just click here to list all the people.

If you peek at the source of the returned RSS feed you’ll find that the essential metadata for each result — in this case the foaf:name property and the rdf:type — is automatically included. Incidentally if you have a FieldPredicateMapping defined with a property name of title then this will automatically be used as the title for the RSS item, allowing you some minor degree of control over the feed structure if you wanted to make it more human-readable.

The Platform provides you with a few more options for managing your search indexes than I’ve covered here. For example the FieldPredicateMap can also be used to associate an Analyzer with the field allowing you to control the indexing rules. You can also control the relevance ranking of the search results through the use of a Query Profile (which is also exposed through an API, and is manageable using Pho). The query profile lets you associate a weighting with a field, so that when a user performs a search without indicating which field they want to search (thereby searching all fields), then the Platform will alter the relevance ranking of the results to suit your preferences.

That concludes our look at the basic steps involved in building a custom search index over the Platform. While the Pho library provides some useful support its worth remembering that its simply a thin veneer over several HTTP operations so achieving the same effects in another language — or even from the console using plain old curl should be easy enough. Hopefully the examples have also illustrated the simplicity of working with the Platform to create some quite powerful features and, importantly, that developing against the Platform doesn’t require your to be a SPARQL wizard: there are other ways to get data out of the system, but the power of SPARQL is there when you need it.

Any questions, then leave a comment and I’ll try to answer them.