Sunday, July 26, 2015

Fetching a DEEP record, ESTC in hand, Part 2

Having been recommended the Database of Early English Plays (DEEP) as a source for descriptive metadata, and mapped our ESTC citations to a STC/Wing number, we can look at programmatically querying DEEP.

DEEP as currently released (the 2007 project) is a PHP search app in which a single resource (search.php) presents the query interface and the results, switching modes according to the HTTP request semantics and form fields. This first is important: As far as I can tell, DEEP requires the form fields to have been POSTed as multipart data, issuing a GET or POST to a DEEP URL composed with query parameters will only return the search page.

The DEEP Search Interface

Looking under the hood of the DEEP search interface requires more than just viewing the source: Some javascript manipulates the form based on user input (in large measure to steer queries on fields with a controlled list of values, like author), but you can get a picture of the effective source in a browser that supports DOM inspection. For example, in Chrome, you can control- or right-click in the search interface, and select 'inspect element'. If we've selected 'STC / Wing Number' as the the search type, we'll see something like this in the inspected source:

That tells us quite a bit about how search.php works, but for our purposes we are concerned about only 2 of the fields:

  1. terms[0][type], which we want to be 'stc_or_wing'
  2. terms[0][val], which we want to be the STC number we're searching for


The other fields pertain to adding a second query, how many results are returned, and how the results are sorted. We're assuming the simplest case (unique match between STC number and description), so the other fields are not relevant (and importantly, not required by the PHP script).

Once we've sorted out these basics, it's not very difficult to execute a DEEP query outside the browser. Here, for example, is some BASH calling cURL, executable from the terminal window or in a shell script:

BASE_URL="http://deep.sas.upenn.edu/search.php" 
FORM="" 
FORM="$FORM --form terms[0][type]=stc_or_wing"
FORM="$FORM --form terms[0][val]=$1"
curl $FORM $BASE_URL
... where $1 is replaced with the STC/Wing number we are searching for. In the case of cURL, you might also use the --data parameter; this would require concatenating them into a single value separated by ampersands (&). If you were using Python and the requests library, something like:
payload = {}
payload['terms[0][type]'] = 'value1'
payload[ 'terms[0][val]'] = 'some STC number'
r = requests.post("http://deep.sas.upenn.edu/search.php", data=payload)
... should work, too.

Parsing the DEEP Results

First, a note of thanks: The creators of DEEP encode its search results in XHTML, a variant of HTML that further requires documents to be valid XML. Although not necessary to produce HTML that's valid XML, it's a sign that the creators care about the documents being parse-able with less effort.

As you will be able to see from the output in your terminal, DEEP presents its search results in a TABLE element with the id 'searchresults'. This table has a row (TR) of column labels (id = 'headerrow'), and then presents the search results as pairs of rows, the first with a class 'record', and the second (containing the details of the description) with no class immediately following. If we were parsing this content with (for example) an XPath utility, we would iterate over:
//table[@id='searchresults']/tr[@class='record']
... and refer also to the next sibling of the TR element in our node handling.

The record rows contain the author (./td[@class='authorname']) and title (./td[@class='playname']). The description is a little more difficult to parse, since the nested div elements in that row present data adjacent to a span[@class='label'] whose content indicates the type of data (e.g. "Greg #"), and are followed by a text node containing the data.

The Other DEEP Interface

One of the fields in that row of description is labelled 'DEEP #', and it is tempting to think that this number might be used to refer directly to a document via the unadvertised single-record view at URLs of the form:
http://deep.sas.upenn.edu/viewrecord.php?deep_id={deep_record_number}
Unfortunately, the DEEP citation number is not (yet) the linkable record number, which appears to be a surrogate key from the backing database. However, if the STC number you've searched for has "contained" descriptions, those descriptions are linked in the div labelled 'Collection contains:'. Each of these anchor elements (span[@class='label' and text()='Collection contains:']/../a) has a javascript URI calling a function in its href attribute, and parsing the number argument to that function will provide you with the id necessary to fetch the contained descriptions with the viewrecord.php script. Conversely, from those descriptions you can mine the viewrecord.php id of the original collection: Follow the same pattern as before, but look instead for the label 'In Collection:'.

Sunday, July 12, 2015

Quick Observations on CAP and Graduate Student Loans

First, please take a look at Scott Weingart's piece on journalism, charts/visualizations, data, and viral texts.

Several articles (eg Washington Post: "These 20 schools are responsible for a fifth of all graduate school debt" July 9; Yahoo: "20 schools account for $6.6 billion of U.S. government grad student loans", July 10) were in circulation this week, apparently re-written from Elizabeth Baylor's piece for the Center for American Progress in the Chronical of Higher Education (paywalled, "As Graduate-Student Debt Booms, Just a Few Colleges Are Largely Responsible", July 8). The source data is available online: It's from the Title IV Program Volume Reports, Loan Volume -> Direct Loan Program -> AY 2013-2014 Q4 (second sheet is award year summary, look at columns T and AD).

These data don't break the awards down by program; in this context, the downstream claims from WaPo:
What’s striking about the Center’s findings is that a majority of the debt taken to attend the 20 schools on its list is not for law or medical degrees that promise hefty paydays. Most graduate students at those schools are seeking master’s degrees in journalism, fine arts or government, according to CAP.
... look a little fishy. The CAP claim is more nuanced (emphases mine):
But it appears that a majority of debt taken on to attend those institutions is not for costly law or medical degrees, but for nonterminal degrees. Among the 20 institutions responsible for the most graduate debt, 81 percent of graduate degrees conferred in the most recent year were master’s degrees. ... As at other universities on the list, most graduate students at those institutions earn master’s degrees, in disciplines like journalism, fine arts, government, and the sciences.
This makes no claim about the relationship of Master's degrees overall to that student debt (though it implies something), nor about proportion of Master's programs to debt. It is worth noting, in these STEM reform times, that WaPo drops "the sciences" from its enumeration of implicitly-blamed programs.

The question raised for me, considering how suspicious the implications about degree programs and debt are, is what percentage of the debt carried by graduate students (especially at the private non-profits) is actually in the more expensive professional schools and "executive" Master's programs.

Wednesday, June 17, 2015

Fetching a DEEP record, ESTC in hand, Part 1

Pretext

I recently began working with a Early Modern English corpus in which the bibliographic metadata was sparse, but all the items were identified with an ESTC number. I was recommended the Database of Early English Playbooks (DEEP) as a source of description, but ran into problems almost immediately...

Wait: Let me pause here and be clear: These are valuable online resources doing important work in support of scholarship. When I say "problems", I mean "inconveniences for someone trying to teach a script or program to process". As valuable as they are, these resources are a little long in the tooth- the DEEP web interface I describe is from 2007, the ESTC interface of unclear vintage. Applying the linkable data, mashable API aesthetics of the mid-00's to them is not fair. And yet, here I am with work to do. So:

Problems!

  1. DEEP is searchable by STC/Wing numbers, but not ESTC citation numbers
  2. DEEP is navigated through pop-ups and what not, and requires POSTed form data to search.

Getting the STC/Wing Numbers

To get the STC/Wing numbers I went to the source the presumptive online source of record: The ESTC online at http://estc.bl.uk/. This interface supports permalinks to identifiers based on the ESTC citation number- promising! These links are of the form http://estc.bl.uk/{CITATION}. Unfortunately, these links do not resolve to the items themselves, but redirect to a search page with a single result. Ignoring the session tracking bits of the URLs (which appear to be removable), the search redirects are like so:
  1. http://estc.bl.uk/{CITATION}
  2. http://estc.bl.uk/F/?func=find-b&local_base=BLL06&request={CITATION}&find_code=ESTID
The resulting page communicates the context of a server-managed search, and so the page presumed to map the permalink has to be parsed out of the response markup. The search result set is nested in tables, but the useful data is down in the result rows, whose data cells have the class "td1". Within those rows are some summary metadata (title, author) and a link to the full-record page we have wanted. W can distinguish these URLs because, in addition to a set_number parameter, they have a set_entry parameter. So, presuming we have only XPATH:

//td[@class=td1]//a[contains(@href,'&set_entry')]/@href

... which should (and we will cross our fingers that there's just the one entry in the result set) get a url of the form:
http://estc.bl.uk/F/?func=full-set-set&set_number=NNNNNN&set_entry=000001&format=999

Result! But where is the STC/Wing number? The table in the full record page's HTML is structured for presentation to humans: Readable, but hard to parse. You could iterate over the rows of the table with ID=estcdata, looking for second rows starting with Wing or STC and trying to parse IDs out of them. A more accurate approach is available if you notice that the full record is available in multiple formats, one of which is MARC. It's linked at the top of the full record page, but you can also get there by changing the format parameter in the URL from 999 to 001. In the MARC format, you can more precisely look for rows in the estcdata table whose first cell's data is '5104'...

Wait, 5104? Yes, this is a combination of a MARC 510 (Citation/Reference) subfield 4 (Location).

... whose first cell's data is '5104', and parse the subfields out. This will be easiest to do in some kind of scripting language, but the subfield delimiter is a pipe '|', the id is a character (in this case 'a' or 'c'), and there's a whitespace for legibility here. We are interested in the value of the c subfield if the a subfield starts with Wing or STC (and we should be case-insensitive to be safe). Whew.

Ok! With a Wing/STC identifier in hand, we can teach a computer to look up a DEEP entry, which at this point is another post.