Wednesday, June 17, 2015

Fetching a DEEP record, ESTC in hand, Part 1

Pretext

I recently began working with a Early Modern English corpus in which the bibliographic metadata was sparse, but all the items were identified with an ESTC number. I was recommended the Database of Early English Playbooks (DEEP) as a source of description, but ran into problems almost immediately...

Wait: Let me pause here and be clear: These are valuable online resources doing important work in support of scholarship. When I say "problems", I mean "inconveniences for someone trying to teach a script or program to process". As valuable as they are, these resources are a little long in the tooth- the DEEP web interface I describe is from 2007, the ESTC interface of unclear vintage. Applying the linkable data, mashable API aesthetics of the mid-00's to them is not fair. And yet, here I am with work to do. So:

Problems!

  1. DEEP is searchable by STC/Wing numbers, but not ESTC citation numbers
  2. DEEP is navigated through pop-ups and what not, and requires POSTed form data to search.

Getting the STC/Wing Numbers

To get the STC/Wing numbers I went to the source the presumptive online source of record: The ESTC online at http://estc.bl.uk/. This interface supports permalinks to identifiers based on the ESTC citation number- promising! These links are of the form http://estc.bl.uk/{CITATION}. Unfortunately, these links do not resolve to the items themselves, but redirect to a search page with a single result. Ignoring the session tracking bits of the URLs (which appear to be removable), the search redirects are like so:
  1. http://estc.bl.uk/{CITATION}
  2. http://estc.bl.uk/F/?func=find-b&local_base=BLL06&request={CITATION}&find_code=ESTID
The resulting page communicates the context of a server-managed search, and so the page presumed to map the permalink has to be parsed out of the response markup. The search result set is nested in tables, but the useful data is down in the result rows, whose data cells have the class "td1". Within those rows are some summary metadata (title, author) and a link to the full-record page we have wanted. W can distinguish these URLs because, in addition to a set_number parameter, they have a set_entry parameter. So, presuming we have only XPATH:

//td[@class=td1]//a[contains(@href,'&set_entry')]/@href

... which should (and we will cross our fingers that there's just the one entry in the result set) get a url of the form:
http://estc.bl.uk/F/?func=full-set-set&set_number=NNNNNN&set_entry=000001&format=999

Result! But where is the STC/Wing number? The table in the full record page's HTML is structured for presentation to humans: Readable, but hard to parse. You could iterate over the rows of the table with ID=estcdata, looking for second rows starting with Wing or STC and trying to parse IDs out of them. A more accurate approach is available if you notice that the full record is available in multiple formats, one of which is MARC. It's linked at the top of the full record page, but you can also get there by changing the format parameter in the URL from 999 to 001. In the MARC format, you can more precisely look for rows in the estcdata table whose first cell's data is '5104'...

Wait, 5104? Yes, this is a combination of a MARC 510 (Citation/Reference) subfield 4 (Location).

... whose first cell's data is '5104', and parse the subfields out. This will be easiest to do in some kind of scripting language, but the subfield delimiter is a pipe '|', the id is a character (in this case 'a' or 'c'), and there's a whitespace for legibility here. We are interested in the value of the c subfield if the a subfield starts with Wing or STC (and we should be case-insensitive to be safe). Whew.

Ok! With a Wing/STC identifier in hand, we can teach a computer to look up a DEEP entry, which at this point is another post.

Saturday, February 13, 2010

Link dump

http://www.ibm.com/developerworks/webservices/library/ws-restwsdl/
http://www.keith-chapman.org/2008/09/restfull-mashup-with-wsdl-20-wso2.html

Saturday, January 30, 2010

Jackrabbit, RMI, etc.

JCR Remote Repo

Should a variation on Server implement javax.jcr.Repository? Or a modularized wrapper? Will need access to either JAAS or internal machinery for authN, as well.

Wednesday, January 20, 2010

Mulgara, allocateDirect, swap space

On sequences of large Mulgara queries, Java was crashing for lack of swap space. Culprit appears to be the lack of reuse of ByteBuffers, all of which are direct byte buffers (and thus outside heap space).

First crack was using pojo byte buffers (fixed swap issue). Second was/is making some of the read only and one-at-a-time classes reuse their buffers by adding a Block recycling method.

edit: Reusing buffers in the find method cuts the live memory for my test "large" searches by a tick over 40%, according to HPROF. Response time for the servlet is decreased as well, but the proportion varies from 40% to 25% according (I suspect) to how big a slice the IO accounts for.

Tuesday, October 13, 2009

Describing a tile

Trying to use the Djatoka jpeg2000 image viewer to display the image tiles / regions served up by an installation of the now-defunct eRez image server underscored the value of a good web API.

eRez Tile API





Parm Name Type Function
src string path to the ptif src, relative to the eRez image root
width integer width of the resulting image tile (will stretch to fit)
height integer height of the resulting image tile (will stretch to fit)
top float the position of the top edge of the tile relative to the entire scaled image, expressed as a decimal fraction
left float see "top"
bottom float see "top"
right float see "top"
scale float the ratio of the dimensions of an entire image composed of tiles in the requested size to the dimensions of the original image, expressed as a decimal fraction.
tmp string constant the value is "ajax-viewer", unquoted

OpenURL getRegion API


Parm Name Type Function
svc.level integer a scaling indicator, as specified here
svc.region integer or float list the top edge position, left edge position, region height and width. Concatenated as a comma-delimited value.
svc.scaleinteger or float listscaling factor as either a single value, or a targeted width and height.
If the latter, a value of zero for one of the dimensions indicates the original proportions should be maintained.

Translating OpenURL Level to eRez Scale


After calculating the maximum levels, any given level converts to scale as:

scale = 1 / 2(maxLevels - requestedLevel)

Wednesday, September 30, 2009

you lying, non-ascii bastards

grep -l $'[\x80-\xff]' * > nonascii.txt

Monday, September 28, 2009

brain dump

What about a collection of micro-apps that extract linked data from epidoc, a la SNERT/OC? One for date info normalization, one to spit back pleiades, etc.