Wednesday, June 17, 2015

Fetching a DEEP record, ESTC in hand, Part 1

Pretext

I recently began working with a Early Modern English corpus in which the bibliographic metadata was sparse, but all the items were identified with an ESTC number. I was recommended the Database of Early English Playbooks (DEEP) as a source of description, but ran into problems almost immediately...

Wait: Let me pause here and be clear: These are valuable online resources doing important work in support of scholarship. When I say "problems", I mean "inconveniences for someone trying to teach a script or program to process". As valuable as they are, these resources are a little long in the tooth- the DEEP web interface I describe is from 2007, the ESTC interface of unclear vintage. Applying the linkable data, mashable API aesthetics of the mid-00's to them is not fair. And yet, here I am with work to do. So:

Problems!

  1. DEEP is searchable by STC/Wing numbers, but not ESTC citation numbers
  2. DEEP is navigated through pop-ups and what not, and requires POSTed form data to search.

Getting the STC/Wing Numbers

To get the STC/Wing numbers I went to the source the presumptive online source of record: The ESTC online at http://estc.bl.uk/. This interface supports permalinks to identifiers based on the ESTC citation number- promising! These links are of the form http://estc.bl.uk/{CITATION}. Unfortunately, these links do not resolve to the items themselves, but redirect to a search page with a single result. Ignoring the session tracking bits of the URLs (which appear to be removable), the search redirects are like so:
  1. http://estc.bl.uk/{CITATION}
  2. http://estc.bl.uk/F/?func=find-b&local_base=BLL06&request={CITATION}&find_code=ESTID
The resulting page communicates the context of a server-managed search, and so the page presumed to map the permalink has to be parsed out of the response markup. The search result set is nested in tables, but the useful data is down in the result rows, whose data cells have the class "td1". Within those rows are some summary metadata (title, author) and a link to the full-record page we have wanted. W can distinguish these URLs because, in addition to a set_number parameter, they have a set_entry parameter. So, presuming we have only XPATH:

//td[@class=td1]//a[contains(@href,'&set_entry')]/@href

... which should (and we will cross our fingers that there's just the one entry in the result set) get a url of the form:
http://estc.bl.uk/F/?func=full-set-set&set_number=NNNNNN&set_entry=000001&format=999

Result! But where is the STC/Wing number? The table in the full record page's HTML is structured for presentation to humans: Readable, but hard to parse. You could iterate over the rows of the table with ID=estcdata, looking for second rows starting with Wing or STC and trying to parse IDs out of them. A more accurate approach is available if you notice that the full record is available in multiple formats, one of which is MARC. It's linked at the top of the full record page, but you can also get there by changing the format parameter in the URL from 999 to 001. In the MARC format, you can more precisely look for rows in the estcdata table whose first cell's data is '5104'...

Wait, 5104? Yes, this is a combination of a MARC 510 (Citation/Reference) subfield 4 (Location).

... whose first cell's data is '5104', and parse the subfields out. This will be easiest to do in some kind of scripting language, but the subfield delimiter is a pipe '|', the id is a character (in this case 'a' or 'c'), and there's a whitespace for legibility here. We are interested in the value of the c subfield if the a subfield starts with Wing or STC (and we should be case-insensitive to be safe). Whew.

Ok! With a Wing/STC identifier in hand, we can teach a computer to look up a DEEP entry, which at this point is another post.