Sunday, July 26, 2015

Fetching a DEEP record, ESTC in hand, Part 2

Having been recommended the Database of Early English Plays (DEEP) as a source for descriptive metadata, and mapped our ESTC citations to a STC/Wing number, we can look at programmatically querying DEEP.

DEEP as currently released (the 2007 project) is a PHP search app in which a single resource (search.php) presents the query interface and the results, switching modes according to the HTTP request semantics and form fields. This first is important: As far as I can tell, DEEP requires the form fields to have been POSTed as multipart data, issuing a GET or POST to a DEEP URL composed with query parameters will only return the search page.

The DEEP Search Interface

Looking under the hood of the DEEP search interface requires more than just viewing the source: Some javascript manipulates the form based on user input (in large measure to steer queries on fields with a controlled list of values, like author), but you can get a picture of the effective source in a browser that supports DOM inspection. For example, in Chrome, you can control- or right-click in the search interface, and select 'inspect element'. If we've selected 'STC / Wing Number' as the the search type, we'll see something like this in the inspected source:

That tells us quite a bit about how search.php works, but for our purposes we are concerned about only 2 of the fields:

  1. terms[0][type], which we want to be 'stc_or_wing'
  2. terms[0][val], which we want to be the STC number we're searching for


The other fields pertain to adding a second query, how many results are returned, and how the results are sorted. We're assuming the simplest case (unique match between STC number and description), so the other fields are not relevant (and importantly, not required by the PHP script).

Once we've sorted out these basics, it's not very difficult to execute a DEEP query outside the browser. Here, for example, is some BASH calling cURL, executable from the terminal window or in a shell script:

BASE_URL="http://deep.sas.upenn.edu/search.php" 
FORM="" 
FORM="$FORM --form terms[0][type]=stc_or_wing"
FORM="$FORM --form terms[0][val]=$1"
curl $FORM $BASE_URL
... where $1 is replaced with the STC/Wing number we are searching for. In the case of cURL, you might also use the --data parameter; this would require concatenating them into a single value separated by ampersands (&). If you were using Python and the requests library, something like:
payload = {}
payload['terms[0][type]'] = 'value1'
payload[ 'terms[0][val]'] = 'some STC number'
r = requests.post("http://deep.sas.upenn.edu/search.php", data=payload)
... should work, too.

Parsing the DEEP Results

First, a note of thanks: The creators of DEEP encode its search results in XHTML, a variant of HTML that further requires documents to be valid XML. Although not necessary to produce HTML that's valid XML, it's a sign that the creators care about the documents being parse-able with less effort.

As you will be able to see from the output in your terminal, DEEP presents its search results in a TABLE element with the id 'searchresults'. This table has a row (TR) of column labels (id = 'headerrow'), and then presents the search results as pairs of rows, the first with a class 'record', and the second (containing the details of the description) with no class immediately following. If we were parsing this content with (for example) an XPath utility, we would iterate over:
//table[@id='searchresults']/tr[@class='record']
... and refer also to the next sibling of the TR element in our node handling.

The record rows contain the author (./td[@class='authorname']) and title (./td[@class='playname']). The description is a little more difficult to parse, since the nested div elements in that row present data adjacent to a span[@class='label'] whose content indicates the type of data (e.g. "Greg #"), and are followed by a text node containing the data.

The Other DEEP Interface

One of the fields in that row of description is labelled 'DEEP #', and it is tempting to think that this number might be used to refer directly to a document via the unadvertised single-record view at URLs of the form:
http://deep.sas.upenn.edu/viewrecord.php?deep_id={deep_record_number}
Unfortunately, the DEEP citation number is not (yet) the linkable record number, which appears to be a surrogate key from the backing database. However, if the STC number you've searched for has "contained" descriptions, those descriptions are linked in the div labelled 'Collection contains:'. Each of these anchor elements (span[@class='label' and text()='Collection contains:']/../a) has a javascript URI calling a function in its href attribute, and parsing the number argument to that function will provide you with the id necessary to fetch the contained descriptions with the viewrecord.php script. Conversely, from those descriptions you can mine the viewrecord.php id of the original collection: Follow the same pattern as before, but look instead for the label 'In Collection:'.

No comments: