A tagline sometimes used for RDF is “self-describing data”. Sometimes though, you make your data describe itself badly; perhaps you’ve used a vocabulary term that has since been deprecated, or perhaps you’ve found a term which is more widely supported, or more appropriate to the data; maybe there was a typo in the script you generated your triples with. At any rate, it’s pretty common to have to fix your data, and if you have a live application with fresh ‘bad’ triples being created all the time, and a lot of bad triples to fix anyway, this can get tricky.
We’re having to do this in a project at the moment, and this is the method Nad and I came up with. We separated out adding good triples and removing bad triples into separate stages because our application will continue to function the same with both good and bad triples, but, once we rollout the code that expects the good triples, we need the good triples, whereas the bad triples can be removed at our leisure.
Adding New Good Triples
So, say for example, one of the things we want to fix is using dcterms:creator instead of dc:creator. We can get the good triples by querying for:
CONSTRUCT {
# good
?s <http://purl.org/dc/terms/creator> ?o
} WHERE {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
}
And posting that back into the store. (First make sure that your application won’t do anything too weird if you add in these triples without changing any code).
If there are a lot of triples in the store, you may not be able to retrieve and post them all at once. To scale to large numbers of triples, just page through the results at, say, 1000 triples at a time by adding LIMIT 1000 OFFSET 0 and incrementing the OFFSET by the LIMIT (1000) until you don’t get triples back anymore.
Wrap this little procedure up in a script because you’ll need to run it again.
Deploy Code
As soon as you have finished adding the new /good/ triples, deploy your new code that uses dcterms:creator instead of dc:creator.
Now run the add good triples script again. This is because, while you were deploying the code, users may have been plugging away at your app, happily creating more bad old triples. Running the script again will add good triples for any of these bad triples that have been created meantime. And because you’ve now deployed your code changes, the application won’t create any more bad triples.
All we have left to do is get rid of the bad old triples. With any luck (and a bit of foresight, and testing), your application will function perfectly well with both bad and good triples in the store, so we can take our time a bit getting rid of the bad triples.
Removing Bad Old Triples
We’ll write a SPARQL query to give us back the triples we want to remove, and then we’ll create a Changeset to remove them:
CONSTRUCT {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
} WHERE {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
# good
?s <http://purl.org/dc/terms/creator> ?o
}
(you should apply a LIMIT, but, so long as you are waiting for each changeset batch to succeed before sending the next one, you don’t need to page – just get back the first 1000 until you don’t get anything back. It’s worth remembering that the number of triples in a changeset document will be about ten-fold the number of triples you are removing, so you may need to make the limit a bit smaller than before).
An important point on the platform’s Changesets API: if you send more than 14 changesets in a batch (ie, in the same document), they will be performed asynchronously and you should get back an HTTP 202 Accepted status code. A potential problem is that, if you are trying to remove a statement that doesn’t exist then all the changes in that batch will fail, but you will still get back a 202 Accepted (because the platform hasn’t tried processing them yet). You need a way of knowing if the batch has failed or not. One way to do this is to include in your batch of changes, the addition of a triple you can then poll the store for to see if it exists or not.
If you’re using PHP, you can use Moriarty to create your changesets:
#php
define('STORE_URI', 'http://api.talis.com/stores/sandbox1');
$markerURI = STORE_URI.'/items/'.time();
$time = time();
$rdfToAdd = " <{$markerURI}> <http://purl.org/dc/terms/created> \"{$time}\" . ";
$args = array(
'before' => $rdfToRemove, // got this from the CONSTRUCT described above
'after' => $rdfToAdd, // this is the marker triple
);
$cs = new Changeset($args);
$store = new Store(STORE_URI); // you will probably need to add your login credentials - see the moriarty docs
$response = $store->get_metabox()->apply_changeset($cs);
if(!$response->is_success()){
//log error, and stop
log_error("Changeset failed: ".$response->status_code ."\n " . $response->body ." \n Changeset: \n" . $cs->to_rdfxml());
break;
}
At this point, you may also want to poll the store for the existence of your marker triple to see if the batch has been processed. Since we minted the marker URI in the store’s URI space, we can just try to dereference it; as soon as we get back a 200 or 303 response, we can move on, but if we still get back a 404 after say 10 seconds, the changeset has probably failed and we need to log that and investigate.
If everything goes OK, you can then make double sure you’ve got rid of all the bad triples by running a quick ASK query against your store’s SPARQL service.
ASK {
# bad
?s <http://purl.org/dc/elements/1.1/creator> ?o
}
And if you get back FALSE, you’re finished.
Well done!