Sunday, July 26, 2015

Fetching a DEEP record, ESTC in hand, Part 2

Having been recommended the Database of Early English Plays (DEEP) as a source for descriptive metadata, and mapped our ESTC citations to a STC/Wing number, we can look at programmatically querying DEEP.

DEEP as currently released (the 2007 project) is a PHP search app in which a single resource (search.php) presents the query interface and the results, switching modes according to the HTTP request semantics and form fields. This first is important: As far as I can tell, DEEP requires the form fields to have been POSTed as multipart data, issuing a GET or POST to a DEEP URL composed with query parameters will only return the search page.

The DEEP Search Interface

Looking under the hood of the DEEP search interface requires more than just viewing the source: Some javascript manipulates the form based on user input (in large measure to steer queries on fields with a controlled list of values, like author), but you can get a picture of the effective source in a browser that supports DOM inspection. For example, in Chrome, you can control- or right-click in the search interface, and select 'inspect element'. If we've selected 'STC / Wing Number' as the the search type, we'll see something like this in the inspected source:

That tells us quite a bit about how search.php works, but for our purposes we are concerned about only 2 of the fields:

  1. terms[0][type], which we want to be 'stc_or_wing'
  2. terms[0][val], which we want to be the STC number we're searching for

The other fields pertain to adding a second query, how many results are returned, and how the results are sorted. We're assuming the simplest case (unique match between STC number and description), so the other fields are not relevant (and importantly, not required by the PHP script).

Once we've sorted out these basics, it's not very difficult to execute a DEEP query outside the browser. Here, for example, is some BASH calling cURL, executable from the terminal window or in a shell script:

FORM="$FORM --form terms[0][type]=stc_or_wing"
FORM="$FORM --form terms[0][val]=$1"
... where $1 is replaced with the STC/Wing number we are searching for. In the case of cURL, you might also use the --data parameter; this would require concatenating them into a single value separated by ampersands (&). If you were using Python and the requests library, something like:
payload = {}
payload['terms[0][type]'] = 'value1'
payload[ 'terms[0][val]'] = 'some STC number'
r ="", data=payload)
... should work, too.

Parsing the DEEP Results

First, a note of thanks: The creators of DEEP encode its search results in XHTML, a variant of HTML that further requires documents to be valid XML. Although not necessary to produce HTML that's valid XML, it's a sign that the creators care about the documents being parse-able with less effort.

As you will be able to see from the output in your terminal, DEEP presents its search results in a TABLE element with the id 'searchresults'. This table has a row (TR) of column labels (id = 'headerrow'), and then presents the search results as pairs of rows, the first with a class 'record', and the second (containing the details of the description) with no class immediately following. If we were parsing this content with (for example) an XPath utility, we would iterate over:
... and refer also to the next sibling of the TR element in our node handling.

The record rows contain the author (./td[@class='authorname']) and title (./td[@class='playname']). The description is a little more difficult to parse, since the nested div elements in that row present data adjacent to a span[@class='label'] whose content indicates the type of data (e.g. "Greg #"), and are followed by a text node containing the data.

The Other DEEP Interface

One of the fields in that row of description is labelled 'DEEP #', and it is tempting to think that this number might be used to refer directly to a document via the unadvertised single-record view at URLs of the form:{deep_record_number}
Unfortunately, the DEEP citation number is not (yet) the linkable record number, which appears to be a surrogate key from the backing database. However, if the STC number you've searched for has "contained" descriptions, those descriptions are linked in the div labelled 'Collection contains:'. Each of these anchor elements (span[@class='label' and text()='Collection contains:']/../a) has a javascript URI calling a function in its href attribute, and parsing the number argument to that function will provide you with the id necessary to fetch the contained descriptions with the viewrecord.php script. Conversely, from those descriptions you can mine the viewrecord.php id of the original collection: Follow the same pattern as before, but look instead for the label 'In Collection:'.

Sunday, July 12, 2015

Quick Observations on CAP and Graduate Student Loans

First, please take a look at Scott Weingart's piece on journalism, charts/visualizations, data, and viral texts.

Several articles (eg Washington Post: "These 20 schools are responsible for a fifth of all graduate school debt" July 9; Yahoo: "20 schools account for $6.6 billion of U.S. government grad student loans", July 10) were in circulation this week, apparently re-written from Elizabeth Baylor's piece for the Center for American Progress in the Chronical of Higher Education (paywalled, "As Graduate-Student Debt Booms, Just a Few Colleges Are Largely Responsible", July 8). The source data is available online: It's from the Title IV Program Volume Reports, Loan Volume -> Direct Loan Program -> AY 2013-2014 Q4 (second sheet is award year summary, look at columns T and AD).

These data don't break the awards down by program; in this context, the downstream claims from WaPo:
What’s striking about the Center’s findings is that a majority of the debt taken to attend the 20 schools on its list is not for law or medical degrees that promise hefty paydays. Most graduate students at those schools are seeking master’s degrees in journalism, fine arts or government, according to CAP.
... look a little fishy. The CAP claim is more nuanced (emphases mine):
But it appears that a majority of debt taken on to attend those institutions is not for costly law or medical degrees, but for nonterminal degrees. Among the 20 institutions responsible for the most graduate debt, 81 percent of graduate degrees conferred in the most recent year were master’s degrees. ... As at other universities on the list, most graduate students at those institutions earn master’s degrees, in disciplines like journalism, fine arts, government, and the sciences.
This makes no claim about the relationship of Master's degrees overall to that student debt (though it implies something), nor about proportion of Master's programs to debt. It is worth noting, in these STEM reform times, that WaPo drops "the sciences" from its enumeration of implicitly-blamed programs.

The question raised for me, considering how suspicious the implications about degree programs and debt are, is what percentage of the debt carried by graduate students (especially at the private non-profits) is actually in the more expensive professional schools and "executive" Master's programs.