Subscribe

Building a Custom Search Index

The Platform is more than just a triple store with a SPARQL interface. It provides a number of other services which are useful for application developers. The most useful of these is the built-in search engine. Each Platform Store has its own search engine that can be used to perform queries over the hosted metadata. So as well as having the option to query your data using the SPARQL query language, you also have the ability to do simple queries over the data with results being returned as RSS 1.0 (with the OpenSearch extensions). This is a nice feature as sometimes you don’t need the full power of SPARQL and for some use cases a more specialized text indexing system is a better option.

The Platform API allows you to configure the system to build a full-text index over any or all of the RDF literals in your stored data. The exception to this the RDF type predicate, this is the only predicate that will have resource values indexed, making it possible for you to construct a search index and queries that can be used to find matches in specific types of RDF resource.

The remainder of this post shows how to configure the Platform to build a custom search index, with example Ruby code using Pho.

Its common in search engine syntax to use a simple friendly name to identify a specific field that you want to search. For example in a Google search you can use “intitle:Blah” to search for the text “blah” only in the HTML title element of indexed pages. The Platform uses a similar mechanism to allow you to map any RDF property URI to a short friendly name suitable for submitting in a search query.

The complete set of these mappings are referred to as a FieldPredicateMap. The mapping is specific to each store, allowing different stores to have their own mappings. The Platform API exposes these mappings allowing you to retrieve and update the mappings yourself.

It is the presence of a mapping of a property URI to a friendly name that triggers the Platform to start indexing the literal values associated with that property. To put this another way: all you have to do to start indexing your literals is define a mapping. Its a simple as that.

Once a mapping is in place, whenever you submit some RDF/XML to your store, the Platform will automatically index all of the mapped triples. The indexing is done asychronously so there might be a short delay between the deposit of new content and the indexes being updated. Standard stuff.

The Pho Ruby API for the Platform provides programmatic access to this functionality, allowing you to script up the management and creation of the FieldPredicateMap. See the rdocs for the FieldPredicateMap class for details.

Here’s an example Ruby script that illustrates how to manage mappings.

To run the script you’ll first need to fill in the name of your store and your admin username and password. You’ll also need to make sure you’ve installed Pho: gem install pho should do the necessary.

The script does several things. Once a store object has been created, the script creates two new mappings. One for the FOAF name predicate, and one for RDF type:


#create the mappings we want
name = Pho::FieldPredicateMap.create_mapping(store, "http://xmlns.com/foaf/0.1/name", "name")

type = Pho::FieldPredicateMap.create_mapping(store, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "type")

The create_mapping method allows you to quickly generate a mapping suitable for adding to a specific store. In order to fetch the current list of mappings the script then does:


#read the existing mappings
mappings = Pho::FieldPredicateMap.read_from_store(store)

#remove anything for this uri
mappings.remove_by_uri("http://xmlns.com/foaf/0.1/name")
mappings.remove_by_uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

#append the new field-name mappings
mappings << name
mappings << type

The read_from_store method does the actual work, doing a GET request to the Platform to retrieve the mappings as JSON, which are then parsed into some useful Ruby objects. The remaining lines then add the newly created mappings to the current collection, after first ensuring that any previous mappings for those URIs have been removed. At this stage we’ve updated our local copy of the mappings but have not yet saved them back to the Platform.

Storing the updated mappings in the Platform is then just a matter of calling the upload method on the mapping object. This serializes the list of mappings as RDF/XML and then PUTs them back to the store. This will overwrite any of the current configuration with the updated copy we’ve got locally: this is one reason why we fetch the current copy before making the changes, to ensure the rest of the configuration is preserved.


resp = mappings.upload(store)
if resp.status_code != 200
  abort("Failed to upload mappings!")
end

The upload method, like many of the lower-level method calls in the Pho library return an HTTP::Message object that you can inspect to determine if the Platform request was successul.

The remaining lines in the sample script simply upload some test data to your store: astronauts.rdf contains a short list of a few astronauts, modelled as simple foaf:Person instances with a foaf:name property. This allows you to test out your newly created search index.

You can now construct item searches with syntax like “name:Buzz” to search for the name Buzz in any foaf:name predicate. Or you can find all foaf:Person instances by performing a search for:

type:"http://xmlns.com/foaf/0.1/Person"

Note that you have to quote the predicate URI. And you can obviously combine those to find only foaf:Person resources with a specific foaf:name.

I’ve run the script against the n2-examples store, so you can use the item search form to test it out. Or just click here to list all the people.

If you peek at the source of the returned RSS feed you’ll find that the essential metadata for each result — in this case the foaf:name property and the rdf:type — is automatically included. Incidentally if you have a FieldPredicateMapping defined with a property name of title then this will automatically be used as the title for the RSS item, allowing you some minor degree of control over the feed structure if you wanted to make it more human-readable.

The Platform provides you with a few more options for managing your search indexes than I’ve covered here. For example the FieldPredicateMap can also be used to associate an Analyzer with the field allowing you to control the indexing rules. You can also control the relevance ranking of the search results through the use of a Query Profile (which is also exposed through an API, and is manageable using Pho). The query profile lets you associate a weighting with a field, so that when a user performs a search without indicating which field they want to search (thereby searching all fields), then the Platform will alter the relevance ranking of the results to suit your preferences.

That concludes our look at the basic steps involved in building a custom search index over the Platform. While the Pho library provides some useful support its worth remembering that its simply a thin veneer over several HTTP operations so achieving the same effects in another language — or even from the console using plain old curl should be easy enough. Hopefully the examples have also illustrated the simplicity of working with the Platform to create some quite powerful features and, importantly, that developing against the Platform doesn’t require your to be a SPARQL wizard: there are other ways to get data out of the system, but the power of SPARQL is there when you need it.

Any questions, then leave a comment and I’ll try to answer them.

GRDDLing DeWitt’s Friends

DeWitt Clinton has a great write-up of Creating a HTML “friends” page from a Google Reader subscription list, a bit of hackery which leads to a hCard microformat-enriched friends list. A little tweak to the HTML can make it more machine-friendly, just adding a HTML Meta Data profile URI:

<head profile="http://www.w3.org/2006/03/hcard">

That profile is GRDDL-enabled, so any GRDDL-aware agent can interpret the source document as RDF. This part’s easy to demonstrate, thanks the online W3C GRDDL service. So I’ve put a tweaked version of the HTML online, and here’s DeWitt’s friends page as RDF (in Turtle syntax, rendered a little verbosely).

Having set this up I realised the data wasn’t actually expressing the friend relationship, so went on to put together some SPARQL to sort that out – below. But afterwards I realised that DeWitt’s HTML was actually expressing the relationships using XFN class names, but again without the profile URI to make it machine-friendly. So another tweak:

<head profile="http://www.w3.org/2006/03/hcard http://www.w3.org/2003/g/td/xfn-workalike">

- the corresponding service output (scroll down to see the extra bits). I suppose I should mention that you can have as many space-separate profiles as you like, and the GRDDL-aware agent will interpret them independently, just accumulating all the triples. The second profile URI adds xfn:friend relationships, I think it would have been more useful with foaf:knows as well, but it is only a demo.One of these days the microformats folks might get around to tweaking the official profile appropriately…

The SPARQL I mentioned looks like this:

prefix rdf:
prefix vcard:
prefix foaf:

CONSTRUCT
{
[ a foaf:Person;
foaf:homepage ;
foaf:name "DeWitt Clinton" ;
]
foaf:knows
[ a foaf:Person;
foaf:homepage ?homepage ;
foaf:name ?name ] .
}
WHERE
{
[ a vcard:VCard ;
vcard:url ?homepage ;
vcard:fn ?name ]
}

- when applied to DeWitt’s data (as RDF), this will map it across from the vCard vocabulary – finding the appropriate ?variables by matching the pattern in the WHERE clause, inserting those ?variables into the CONSTRUCT clause to produce some new RDF.

I tried this on the Redland SPARQL demo, and I think it’s producing the RDF I wanted. Unfortunately the serialization is really ugly – lots of bnodes, and it’s hard to check visually. It appears to confuse Tabulator too, and the W3C RDF Validator which is handy for this kind of visualization appears to be down. (Here’s a copy of the RDF/XML). Still, it was only a workaround – with the right profiles in place it’s not needed.

I’m not sure if there’s a microformat way of expressing that the source data was a subscription/reading list. To get the richest RDF out it might be easier to do what DeWitt did, but to a full RDF serialization rather than microformatted HTML (which is effectively a CustomRdfDialect), producing something like Planet RDF‘s blogroll.