Data Migration using SPARQL and Changesets
A tagline sometimes used for RDF is “self-describing data”. Sometimes though, you make your data describe itself badly; perhaps you’ve used a vocabulary term that has since been deprecated, or perhaps you’ve found a term which is more widely supported, or more appropriate to the data; maybe there was a typo in the script you generated your triples with. At any rate, it’s pretty common to have to fix your data, and if you have a live application with fresh ‘bad’ triples being created all the time, and a lot of bad triples to fix anyway, this can get tricky.
We’re having to do this in a project at the moment, and this is the method Nad and I came up with. We separated out adding good triples and removing bad triples into separate stages because our application will continue to function the same with both good and bad triples, but, once we rollout the code that expects the good triples, we need the good triples, whereas the bad triples can be removed at our leisure.
Adding New Good Triples
So, say for example, one of the things we want to fix is using dcterms:creator instead of dc:creator. We can get the good triples by querying for:
CONSTRUCT {
# good
?s <http://purl.org/dc/terms/creator> ?o
} WHERE {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
}
And posting that back into the store. (First make sure that your application won’t do anything too weird if you add in these triples without changing any code).
If there are a lot of triples in the store, you may not be able to retrieve and post them all at once. To scale to large numbers of triples, just page through the results at, say, 1000 triples at a time by adding LIMIT 1000 OFFSET 0 and incrementing the OFFSET by the LIMIT (1000) until you don’t get triples back anymore.
Wrap this little procedure up in a script because you’ll need to run it again.
Deploy Code
As soon as you have finished adding the new /good/ triples, deploy your new code that uses dcterms:creator instead of dc:creator.
Now run the add good triples script again. This is because, while you were deploying the code, users may have been plugging away at your app, happily creating more bad old triples. Running the script again will add good triples for any of these bad triples that have been created meantime. And because you’ve now deployed your code changes, the application won’t create any more bad triples.
All we have left to do is get rid of the bad old triples. With any luck (and a bit of foresight, and testing), your application will function perfectly well with both bad and good triples in the store, so we can take our time a bit getting rid of the bad triples.
Removing Bad Old Triples
We’ll write a SPARQL query to give us back the triples we want to remove, and then we’ll create a Changeset to remove them:
CONSTRUCT {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
} WHERE {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
# good
?s <http://purl.org/dc/terms/creator> ?o
}
(you should apply a LIMIT, but, so long as you are waiting for each changeset batch to succeed before sending the next one, you don’t need to page – just get back the first 1000 until you don’t get anything back. It’s worth remembering that the number of triples in a changeset document will be about ten-fold the number of triples you are removing, so you may need to make the limit a bit smaller than before).
An important point on the platform’s Changesets API: if you send more than 14 changesets in a batch (ie, in the same document), they will be performed asynchronously and you should get back an HTTP 202 Accepted status code. A potential problem is that, if you are trying to remove a statement that doesn’t exist then all the changes in that batch will fail, but you will still get back a 202 Accepted (because the platform hasn’t tried processing them yet). You need a way of knowing if the batch has failed or not. One way to do this is to include in your batch of changes, the addition of a triple you can then poll the store for to see if it exists or not.
If you’re using PHP, you can use Moriarty to create your changesets:
#php
define('STORE_URI', 'http://api.talis.com/stores/sandbox1');
$markerURI = STORE_URI.'/items/'.time();
$time = time();
$rdfToAdd = " <{$markerURI}> <http://purl.org/dc/terms/created> \"{$time}\" . ";
$args = array(
'before' => $rdfToRemove, // got this from the CONSTRUCT described above
'after' => $rdfToAdd, // this is the marker triple
);
$cs = new Changeset($args);
$store = new Store(STORE_URI); // you will probably need to add your login credentials - see the moriarty docs
$response = $store->get_metabox()->apply_changeset($cs);
if(!$response->is_success()){
//log error, and stop
log_error("Changeset failed: ".$response->status_code ."\n " . $response->body ." \n Changeset: \n" . $cs->to_rdfxml());
break;
}
At this point, you may also want to poll the store for the existence of your marker triple to see if the batch has been processed. Since we minted the marker URI in the store’s URI space, we can just try to dereference it; as soon as we get back a 200 or 303 response, we can move on, but if we still get back a 404 after say 10 seconds, the changeset has probably failed and we need to log that and investigate.
If everything goes OK, you can then make double sure you’ve got rid of all the bad triples by running a quick ASK query against your store’s SPARQL service.
ASK {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
}
And if you get back FALSE, you’re finished.
Well done!


July 10th, 2009 at 9:15 am
Thanks for writing this up. Nice and practical
Any thoughts on how to deal with code “out there” which might be using the old graph patterns? Can a back-compatible extra store (or named graph) be generated in similar fashion? But I guess SPARQL endpoint details would need to change too. Have you looked into the pros/cons of rewriting incoming queries?
July 10th, 2009 at 9:28 am
Nice one Keith, could come in handy!
July 10th, 2009 at 9:34 am
Dan, I think you could do rewriting of queries or some OWL might help. But practically speaking I would just leave the original triples there for a while. This does does raise the question of which side should bear the cost of changing data models: the client or the server. There’s an argument to say that since clients are dealing with arbitrary data then they should be flexible in what they look for in a dataset. I have some notes on “open world” development that I really need to write up as a blog post….
July 10th, 2009 at 10:53 am
Interesting points. Rewriting the incoming queries could provide some backwards compatibility – and I guess, if the query is a CONSTRUCT or a DESCRIBE, you ought to also rewrite the RDF on the way out again, otherwise the client code could still break. But maybe this should be for special cases, where there are specific breakages you want to avoid while still migrating. I’d side in general with the view that the client is responsible for coping with the data it gets, and the server is responsible for a stable API, though not necessarily stable data. But it’s pretty likely there will also be special cases when being backwards-compatible data-wise is also important, and rewriting queries and data is a good solution (and what about when the data is retrieved by dereffing LOD, not SPARQLed? What should happen then?).
You could also leave the old triples in the store instead of rewriting, though there are potential drawbacks with that bloating the size of your store and response document sizes (you could possibly filter out the old triples from the RDF you are returning unless they are specified in the SPARQL query). In some cases having both old and new triples might make your data wrong or inconsistent.
The backwards-compatible store idea could be OK if the data is fairly static – you could sniff the SPARQL query for old triple patterns and redirect to the appropriate endpoint maybe?
There are lots of different ways you might change your data too – it could be a typo or switching vocab terms, or it could be a radically different modeling; in which case, it would be harder for the client to anticipate the change (and might even require more extensive rewriting of the client application).
Assuming that it is the client’s responsibility to adjust to changes in data pulled down from the wild web, there are a couple of things that I think might help:
1. A way of bundling up equivalent terms / graph patterns for specific situations. Maybe some collections of owl:sameAs and rdfs:subPropertyOf statements and the like, and a bit of reasoning, could work, but maybe a better solution would be something like profiles, where terms/graph patterns are stated to be functionally substitutable for a given task, eg: Here are a bunch of predicate URIs you can look through if you want to display something as a label, or here are a bunch of things you can try if you want to show an image associated with a (non-IR) resource; or (trickier) here’s a bunch of patterns you can try if you want to find tags/taggings. These profiles (for want of a better word) could be kept up to date and served up from somewhere, and the client application could retrieve and cache them – so then the client could be vocabulary agnostic, and cope with evolutions in data modeling, provided the profiles are kept up to date.
2. Ways for the server to hint to clients when a data migration has taken place – and what to do about it. I suppose the lightest weight option is to include old and new triples in the response for a while and hope the client picks up on it. The server could include extra metadata in the response document, or in it’s own (RDF, of course) service description, or dataset description (voiD provides terms to say what vocabularies a dataset is using, for example).