Subscribe

Author Archive

Exploring OpenLibrary Part Two

More than two weeks on from my last look at the OpenLibrary authors data and I’m finally finding some time to look a bit deeper. Last time I finished off thinking about the complete list of distinct dates within the authors file and how to model those.

Where I’ve got to today is tagged as day 2 of OpenLibrary in the n2 subversion.

First off, a correction – foaf:Name should have been foaf:name. Thanks to Leigh for pointing that out. I haven’t fixed in this tag, tagged before I realised I’d forgotten it, but next time, honestly.

It’s clear that there is some stuff in the data that simply shouldn’t be there, things that cannot possibly be a birth date such [from old catalog] and *. and simply ,. When I came across —oOo— I was somewhat dismayed. MARC data, where most of this data has come from, has a long and illustrious history, but one of the mistakes made early on was to put display data into the records in the form of ISBD punctuation. This, combined with the real inflexibility of most ILSs and web-based catalogs has forced libraries to hack there records with junk like —oOo— to fix display errors. This one comes from Antonio Ignacio Margariti.

In total there are only 6,156 unique birth date datums and 4,936 unique death dates. Of course there is some overlap, so in total there’s only 9,566 datums to worry about overall.

So what I plan to do is to set up the recognisable patterns in code and discard anything I don’t recognise as a date or date range. Doing that may mean I lose some date information, but I can add that back in later as more patterns get spotted. So far I’ve found several patterns (shown here using regex notation)…

“^[0-9]{1,4}$” – A straightforward number of 4 digits or fewer, no letters, punctuation or whitespace. These are simple years, last week I popped them in using bio:date . That’s not strictly within the rules of the bio schema as that really requires a date formatted in accordance with ISO8601. Ian had already implied his dis-pleasure with my use of bio:date and suggested I use the more relaxed dc elements date. However, on further chatting what we actually have is a date range within which the event occurred, so we need to show that the event happened somewhere within a date range. This can be solved using the W3C Time Ontology which allows for better description.

I spent some time getting hung up on exactly what is being said by these date assertions on a bio:Birth event. That is, are we saying that the birth took place somewhere within that period, or that the event happened over that period. This may seem a daft question to ask, but as others start modelling events in peoples’ bios this could easily become indistinguishable. Say I want to model my grandfather’s experience of the second world war. I’d very likely model that as an event occurring over a four year period. So, I feel the need to distinguish between an event happening over a period and an event happening at an unknown time within a period. I thought I was getting too pedantic about this, but Ian assured me I’m not and that the distinction matters.

The model we end up with is like this


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL149323A>
	foaf:Name "Schaller, Heinrich";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL149323A>;
	bio:event <http://example.com/a/OL149323A#birth>;
	a foaf:Person .

<http://example.com/a/OL149323A#birth>
	dc:date <http://example.com/a/OL149323A#birthDate>;
	a bio:Birth .

<http://example.com/names/schallerheinrich>
	mine:name_of <http://example.com/a/OL149323A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1900>
	time:unitType time:unitYear;
	time:year "1900";
	a time:DateTimeDescription .

<http://example.com/a/OL149323A#birthDate>
	time:inDateTime <http://example.com/dates/gregorian/ad/years/1900>;
	a time:Instant .

The simple year accounts for 731,304 of the 748,291 birth dates and for 13,151 of the 181,696 death dates, about 80% of the dates overall. Following the 80/20 rule almost perfectly, the remaining 20% is going to be painful. It has been suggested I should stop here, but it seems a shame to not have access to the rest if we can dig in, and I can, so…

First of the remaining correct entries are the approximate years, recorded as ca. 1753 or (ca.) 1753 and other variants of that. These all suffer from leading and trailing junk, but I’ll catch the clean ones of these with “^[(]?ca\.[)]? ([0-9]{1,4})$”. The difficulty with these is that you can’t really convert these into a single year or even a date range as what people consider as within the “circa” will vary widely in different contexts. So, the interval can be described in the same way as a simple year, but the relationship with the authors birth is not simply time:inDateTime. I haven’t found a sensible circa predicate, so for now I’ll drop into mine.


@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mine: <http://example.com/mine/schema#> .
@prefix time: <http://www.w3.org/TR/owl-time/> .

<http://example.com/a/OL151554A>
	foaf:Name "Altdorfer, Albrecht";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL151554A>;
	bio:event <http://example.com/a/OL151554A#birth>;
	bio:event <http://example.com/a/OL151554A#death>;
	a foaf:Person .

<http://example.com/a/OL151554A#birth>
	dc:date <http://example.com/a/OL151554A#birthDate>;
	a bio:Birth .

<http://example.com/a/OL151554A#death>
	dc:date <http://example.com/a/OL151554A#deathDate>;
	a bio:Death .

<http://example.com/names/altdorferalbrecht>
	mine:name_of <http://example.com/a/OL151554A>;
	a mine:Name .

<http://example.com/dates/gregorian/ad/years/1480>
	time:unitType time:unitYear;
	time:year "1480";
	a time:DateTimeDescription .

<http://example.com/a/OL151554A#birthDate>
	mine:circaDateTime <http://example.com/dates/gregorian/ad/years/1480>;
	a time:Instant .

Ok, it’s time to stop there until next time. I have several remaining forms to look at and some issues of data cleanup.

Next time I’ll be looking at parsing out date ranges of a few years, shown in the data 1103 or 4. These will go in as longer date time descriptions so no new modelling needed.

Then we have centuries, 7th cent., again just a broader date time description required I hope. There are some entries for works from before the birth of Christ – 127 B.C.. I’ll have to take a look at how those get described. Then we have entries starting with an l like l854. I had thought that these may indicate a different calendaring system, but it appear not. Perhaps it’s bad OCRing as there are also entries like l8l4. Not sure what to do with those just yet.

In terms of data cleanup, there are dates in the birth_date field of the form d. 1823 which means that it’s actually a death date. There are also dates prefixed with fl. which means they are flourishing dates. These are used when a birth date is unknown but the period in which the creator was active is known. These need to be pulled out and handled separately.

Of course, I haven’t dealt with the leading and trailing punctuation yet or those that have names mixed in with the dates, so still much work to do in transforming this into a rich graph.

Exploring OpenLibrary Part One

I thought it was about time I got around to taking a better look at what might be possible with the OpenLibrary data.

My plan is to try and convert it into meaningful RDF and see what we can find out about things along the way. The project is an own-time project mostly, so progress isn’t likely to be very rapid. Let’s see how it goes. I’ll diary here as stuff gets done.

To save me typing loads of stuff out here, today’s source code is tagged and in the n2 subversion as day 1 of OpenLibrary.

Day one, 3rd October 2008, I downloaded the authors data from OpenLibrary and unzipped it. I’m also downloading the editions data from OpenLibrary, but that’s bigger (1.8Gb) so I’m playing with the author data while that comes down the tubes.

The data has been exported by OpenLibrary as JSON, so is pretty easy to work with. I’m going to write some PHP scripts on the command line to mess with it and it looks great for doing that.

Each line of the JSON in the authors file represents a single author, although some authors will have more than one entry. Taking a look at Iain Banks (aka Iain M Banks) we have the following entries:


{"name": "Banks, Iain", "personal_name": "Banks, Iain", "key": "\/a\/OL32312A", "birth_date": "1954", "type": {"key": "\/type\/type"}, "id": 81616}
{"name": "Banks, Iain.", "type": {"key": "\/type\/type"}, "id": 3011389, "key": "\/a\/OL954586A", "personal_name": "Banks, Iain."}
{"type": {"key": "\/type\/type"}, "id": 9897124, "key": "\/a\/OL2623466A", "name": "Iain Banks"}
{"type": {"key": "\/type\/type"}, "id": 9975649, "key": "\/a\/OL2645303A", "name": "Iain Banks         "}
{"type": {"key": "\/type\/type"}, "id": 10565263, "key": "\/a\/OL2774908A", "name": "IAIN M. BANKS"}
{"type": {"key": "\/type\/type"}, "id": 10626661, "key": "\/a\/OL2787336A", "name": "Iain M. Banks"}
{"type": {"key": "\/type\/type"}, "id": 12035518, "key": "\/a\/OL3127859A", "name": "Iain M Banks"}
{"type": {"key": "\/type\/type"}, "id": 12078804, "key": "\/a\/OL3137983A", "name": "Iain M Banks         "}
{"type": {"key": "\/type\/type"}, "id": 12177832, "key": "\/a\/OL3160648A", "name": "IAIN M.BANKS"}

In total the file contains 4,174,245 entries. First job is to get a more manageable set of data to work with. So, I wrote a short script to extract 1 line in every 10 from a file. The resulting sample author data file contains 417,424 entries. This is more manageable for quick testing of what I’m doing.

So now we can start writing some code to produce some RDF. Given the size of these files, I need to stream the data in and out again in chunks. The easiest format I find for that is turtle which has the added benefit of being human readable. YMMV. Previously I’ve streamed stuff out using n-triples. That has some great benefits too, like being able to generate different parts of the graph, for the same subject, in different parts of the file then being them together using a simple command line sort. It’s also a great format for chunking the resulting data into reasonable size files as breaking on whole lines doesn’t break the graph, whereas with rdf/xml and turtle it does.

So, I may end up dropping back to n-triples, but for now I’m going to use turtle.

I also like working on the command line and love the unix pipes model, so I’ll be writing the cli (command line) tools to read from STDIN and write to STDOUT so I can mess with the data using grep, sed, awk, sort, uniq and so on.

First things first, Let’s find out what’s really in the authors data. Reading the json line by line and converting each line into an associative array is simple in PHP, so let’s do that, keep track of all the keys we find in the arrays and recurse into the nested arrays to look at them – then dump the result out. The arrays contain this set of keys:

alternate_names
alternate_names
alternate_names\1
alternate_names\2
alternate_names\3
bio
birth_date
comment
date
death_date
entity_type
fuller_name
id
key
location
name
numeration
personal_name
photograph
title
type
type\key
website

So, they have names, birth dates, death dates, alternate names and a few other bits and pieces. And they have a ‘key’ which turns out to be the resource part of the OpenLibrary url. That’s means we can link back into OpenLibrary nice and easy. Going back to our previous Iain Banks examples, we want to create something like this for each one:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix bio: <http://vocab.org/bio/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://example.com/a/OL32312A>
	foaf:Name "Banks, Iain";
	foaf:primaryTopicOf <http://openlibrary.org/a/OL32312A>;
	bio:event <http://example.com/a/OL32312A#birth>;
	a foaf:Person .

<http://example.com/a/OL32312A#birth>
	bio:date "1954";
	a bio:Birth .

This gives us a foaf:Person for the author and tracks his birth date using a bio:Birth event. While tracking the birth as a separate entity may seem odd it gives the opportunity to say things about the birth itself. We’ll model death dates the same way, for the same reason. I’ve written some basic code to generate foaf from the OpenLibrary authors.

Linking back to the OpenLibrary url has been done here using foaf:primaryTopicOf. I didn’t use owl:sameAs because the url at OpenLibrary is that of a web page, whereas the uri here (http://example.com/a/OL32312A) represents a person. Clearly a person is not the same as a web page that contains information about them.

The only thing worrying me is that the uris we’re using are constructed from OpenLibrary’s keys. This makes matching them up with other data sources hard. Matching with other data sources requires a natural key, but there’s not enough data in these author entries to create one. The best I can do is to create a natural key that will enable people to discover the group of authors that share a name.


@prefix mine: <http://example.com/mine/schema#> .
<http://example.com/names/banksiain>
	mine:name_of <http://example.com/a/OL32312A>;
	a mine:Name .

These uris will enable me to find authors that share the same name easily, either because they do share the same name or because they’re duplicates. The natural key is simply the author’s name with any casing, whitespace or punctuation stripped out. That might need to evolve as I start looking at the names in more detail later.

Next step is to look in more detail at the dates in here, we have some simple cases of trailing whitespace or trailing punctuation, but also some more interesting cases of approximate dates or possible ranges – these occur for historical authors mostly. The complete list of distinct dates within the authors file is in svn. If you know anything about dates, feel free to throw me some free advice on what to do with them…