Thursday, July 13, 2017

Drifting

In a post reflecting on the software development practice in the Hydra/Samvera community, Jonathan Rochkind begins a late pivot towards a more general complaint by framing Samvera:

And finally, a particularly touchy evaluation of all for the hydra/samvera project; but the hydra project is 5-7 years old, long enough to evaluate some basic premises. I’m talking about the twin closely related requirements which have been more or less assumed by the community for most of the project’s history:
1) That the stack has to be based on fedora/fcrepo, and
2) that the stack has to be based on native RDF/linked data, or even coupled to RDF/linked data at all.
I believe these were uncontroversial assumptions rather than entirely conscious decisions, but I think it’s time to look back and wonder how well they’ve served us, and I’m not sure it’s well.

This 5 sentence history of Hydra/Samvera is a fabrication. The Hydra project began in 2008 as attempt to combine a Blacklight discovery layer and a Fedora 3 repository, debatably with the goal of improving the notion of services/disseminators in the Fedora 3 CMA by making them contained applications. The Fedora Commons project was one of its original partners. It's strange to characterize that backend as an assumption rather than the motivating use case when the core library from the project's onset is ActiveFedora (published February 2009).

I'm more sympathetic to interrogating the relationship of Samvera to linked data, but casting that decision as an assumption- rather than the conscious development goals of Hydra/Samvera partners who were trying focus their descriptions less on XML serializations and more on the description as data- is patronizing. I can agree that we should "look back and wonder how well they’ve served us", but it's always been the time to do that (as far as I can tell, the ActiveFedora:RDF package was introduced in 2013 as a reaction to frustration managing object descriptions as files). If I were an employee at Penn State, whose work prompted the accommodation of RDF as ActiveRecord-style properties rather than as serialized files, I'd be insulted to see my work characterized as the product of not "entirely conscious decisions" or "uncontroversial assumption[s]".

At more than one meeting now (in the interest of disclosure: I gave a talk on related topics at OR2016, participated in two panels touching on the issue at Hydra Connect 2016, and have been involved for some years with the Fedora Commons community and project), there's been open discussion of what the relationship of Samvera to Fedora ought to be going forward. It's clearly a question motivating the work on Valkyrie at Princeton. Over the years there's been more than one alternate backend written to mimic the Fedora APIs. There's analogous conversations in the world of Blacklight when a potential installer wants to use a different noSql store than Solr.

The critical question underlying those efforts and conversations is to what degree the software products of these various projects should be shareable, whether the surface of interaction is within a platform, across a shared index, across API abstractions, or at the achievement of consensus around use case and functional requirements. Rochkind takes a different tack, suggesting that a hard pivot away from abstraction should be the baseline and arguing that we need to justify any commitment beyond Rails and Paperclip. This strikes me as reductive and dismisses of my own experience: that generalizing description in a database moves pretty quickly towards re-inventing RDF in tables, and that storing blobs of serialized description leads to re-inventions of Fedora without the mediating APIs. If our response to the problems motivating Rochkind's post were to advocate interacting directly with the backing databases and file systems of Fedora, it might work - it might be faster! - but we would certainly not be proposing it as a minimalist path towards more sustainable software approaches.

We can scrutinize our approaches to the problem of managing assets and description in shareable ways without a fabulous and dismissive historical framing of the project and the use cases of its participating institutions. But we should also be cognizant of what *some kind* of abstraction yields: An Avalon or a Charon can function as a common tool to originate content subsequently repurposed for independent, locally developed publication platforms; I still think we'll inch towards shared practice, and thus shared content, with Islandora. Integrative projects like this require some kind of interface- the question is where to locate it.

Tuesday, December 20, 2016

Just a bunch of ESTC Library Names

Following Meaghan Brown on trying to match STC and ESTC library names, I threw together a quickie ruby script that parses all the library names from the ESTC library name browse list, then follows the "Next Page" links while they are present and grabs the next set.

Sunday, July 26, 2015

Fetching a DEEP record, ESTC in hand, Part 2

Having been recommended the Database of Early English Plays (DEEP) as a source for descriptive metadata, and mapped our ESTC citations to a STC/Wing number, we can look at programmatically querying DEEP.

DEEP as currently released (the 2007 project) is a PHP search app in which a single resource (search.php) presents the query interface and the results, switching modes according to the HTTP request semantics and form fields. This first is important: As far as I can tell, DEEP requires the form fields to have been POSTed as multipart data, issuing a GET or POST to a DEEP URL composed with query parameters will only return the search page.

The DEEP Search Interface

Looking under the hood of the DEEP search interface requires more than just viewing the source: Some javascript manipulates the form based on user input (in large measure to steer queries on fields with a controlled list of values, like author), but you can get a picture of the effective source in a browser that supports DOM inspection. For example, in Chrome, you can control- or right-click in the search interface, and select 'inspect element'. If we've selected 'STC / Wing Number' as the the search type, we'll see something like this in the inspected source:

That tells us quite a bit about how search.php works, but for our purposes we are concerned about only 2 of the fields:

  1. terms[0][type], which we want to be 'stc_or_wing'
  2. terms[0][val], which we want to be the STC number we're searching for


The other fields pertain to adding a second query, how many results are returned, and how the results are sorted. We're assuming the simplest case (unique match between STC number and description), so the other fields are not relevant (and importantly, not required by the PHP script).

Once we've sorted out these basics, it's not very difficult to execute a DEEP query outside the browser. Here, for example, is some BASH calling cURL, executable from the terminal window or in a shell script:

BASE_URL="http://deep.sas.upenn.edu/search.php" 
FORM="" 
FORM="$FORM --form terms[0][type]=stc_or_wing"
FORM="$FORM --form terms[0][val]=$1"
curl $FORM $BASE_URL
... where $1 is replaced with the STC/Wing number we are searching for. In the case of cURL, you might also use the --data parameter; this would require concatenating them into a single value separated by ampersands (&). If you were using Python and the requests library, something like:
payload = {}
payload['terms[0][type]'] = 'value1'
payload[ 'terms[0][val]'] = 'some STC number'
r = requests.post("http://deep.sas.upenn.edu/search.php", data=payload)
... should work, too.

Parsing the DEEP Results

First, a note of thanks: The creators of DEEP encode its search results in XHTML, a variant of HTML that further requires documents to be valid XML. Although not necessary to produce HTML that's valid XML, it's a sign that the creators care about the documents being parse-able with less effort.

As you will be able to see from the output in your terminal, DEEP presents its search results in a TABLE element with the id 'searchresults'. This table has a row (TR) of column labels (id = 'headerrow'), and then presents the search results as pairs of rows, the first with a class 'record', and the second (containing the details of the description) with no class immediately following. If we were parsing this content with (for example) an XPath utility, we would iterate over:
//table[@id='searchresults']/tr[@class='record']
... and refer also to the next sibling of the TR element in our node handling.

The record rows contain the author (./td[@class='authorname']) and title (./td[@class='playname']). The description is a little more difficult to parse, since the nested div elements in that row present data adjacent to a span[@class='label'] whose content indicates the type of data (e.g. "Greg #"), and are followed by a text node containing the data.

The Other DEEP Interface

One of the fields in that row of description is labelled 'DEEP #', and it is tempting to think that this number might be used to refer directly to a document via the unadvertised single-record view at URLs of the form:
http://deep.sas.upenn.edu/viewrecord.php?deep_id={deep_record_number}
Unfortunately, the DEEP citation number is not (yet) the linkable record number, which appears to be a surrogate key from the backing database. However, if the STC number you've searched for has "contained" descriptions, those descriptions are linked in the div labelled 'Collection contains:'. Each of these anchor elements (span[@class='label' and text()='Collection contains:']/../a) has a javascript URI calling a function in its href attribute, and parsing the number argument to that function will provide you with the id necessary to fetch the contained descriptions with the viewrecord.php script. Conversely, from those descriptions you can mine the viewrecord.php id of the original collection: Follow the same pattern as before, but look instead for the label 'In Collection:'.

Sunday, July 12, 2015

Quick Observations on CAP and Graduate Student Loans

First, please take a look at Scott Weingart's piece on journalism, charts/visualizations, data, and viral texts.

Several articles (eg Washington Post: "These 20 schools are responsible for a fifth of all graduate school debt" July 9; Yahoo: "20 schools account for $6.6 billion of U.S. government grad student loans", July 10) were in circulation this week, apparently re-written from Elizabeth Baylor's piece for the Center for American Progress in the Chronical of Higher Education (paywalled, "As Graduate-Student Debt Booms, Just a Few Colleges Are Largely Responsible", July 8). The source data is available online: It's from the Title IV Program Volume Reports, Loan Volume -> Direct Loan Program -> AY 2013-2014 Q4 (second sheet is award year summary, look at columns T and AD).

These data don't break the awards down by program; in this context, the downstream claims from WaPo:
What’s striking about the Center’s findings is that a majority of the debt taken to attend the 20 schools on its list is not for law or medical degrees that promise hefty paydays. Most graduate students at those schools are seeking master’s degrees in journalism, fine arts or government, according to CAP.
... look a little fishy. The CAP claim is more nuanced (emphases mine):
But it appears that a majority of debt taken on to attend those institutions is not for costly law or medical degrees, but for nonterminal degrees. Among the 20 institutions responsible for the most graduate debt, 81 percent of graduate degrees conferred in the most recent year were master’s degrees. ... As at other universities on the list, most graduate students at those institutions earn master’s degrees, in disciplines like journalism, fine arts, government, and the sciences.
This makes no claim about the relationship of Master's degrees overall to that student debt (though it implies something), nor about proportion of Master's programs to debt. It is worth noting, in these STEM reform times, that WaPo drops "the sciences" from its enumeration of implicitly-blamed programs.

The question raised for me, considering how suspicious the implications about degree programs and debt are, is what percentage of the debt carried by graduate students (especially at the private non-profits) is actually in the more expensive professional schools and "executive" Master's programs.

Wednesday, June 17, 2015

Fetching a DEEP record, ESTC in hand, Part 1

Pretext

I recently began working with a Early Modern English corpus in which the bibliographic metadata was sparse, but all the items were identified with an ESTC number. I was recommended the Database of Early English Playbooks (DEEP) as a source of description, but ran into problems almost immediately...

Wait: Let me pause here and be clear: These are valuable online resources doing important work in support of scholarship. When I say "problems", I mean "inconveniences for someone trying to teach a script or program to process". As valuable as they are, these resources are a little long in the tooth- the DEEP web interface I describe is from 2007, the ESTC interface of unclear vintage. Applying the linkable data, mashable API aesthetics of the mid-00's to them is not fair. And yet, here I am with work to do. So:

Problems!

  1. DEEP is searchable by STC/Wing numbers, but not ESTC citation numbers
  2. DEEP is navigated through pop-ups and what not, and requires POSTed form data to search.

Getting the STC/Wing Numbers

To get the STC/Wing numbers I went to the source the presumptive online source of record: The ESTC online at http://estc.bl.uk/. This interface supports permalinks to identifiers based on the ESTC citation number- promising! These links are of the form http://estc.bl.uk/{CITATION}. Unfortunately, these links do not resolve to the items themselves, but redirect to a search page with a single result. Ignoring the session tracking bits of the URLs (which appear to be removable), the search redirects are like so:
  1. http://estc.bl.uk/{CITATION}
  2. http://estc.bl.uk/F/?func=find-b&local_base=BLL06&request={CITATION}&find_code=ESTID
The resulting page communicates the context of a server-managed search, and so the page presumed to map the permalink has to be parsed out of the response markup. The search result set is nested in tables, but the useful data is down in the result rows, whose data cells have the class "td1". Within those rows are some summary metadata (title, author) and a link to the full-record page we have wanted. W can distinguish these URLs because, in addition to a set_number parameter, they have a set_entry parameter. So, presuming we have only XPATH:

//td[@class=td1]//a[contains(@href,'&set_entry')]/@href

... which should (and we will cross our fingers that there's just the one entry in the result set) get a url of the form:
http://estc.bl.uk/F/?func=full-set-set&set_number=NNNNNN&set_entry=000001&format=999

Result! But where is the STC/Wing number? The table in the full record page's HTML is structured for presentation to humans: Readable, but hard to parse. You could iterate over the rows of the table with ID=estcdata, looking for second rows starting with Wing or STC and trying to parse IDs out of them. A more accurate approach is available if you notice that the full record is available in multiple formats, one of which is MARC. It's linked at the top of the full record page, but you can also get there by changing the format parameter in the URL from 999 to 001. In the MARC format, you can more precisely look for rows in the estcdata table whose first cell's data is '5104'...

Wait, 5104? Yes, this is a combination of a MARC 510 (Citation/Reference) subfield 4 (Location).

... whose first cell's data is '5104', and parse the subfields out. This will be easiest to do in some kind of scripting language, but the subfield delimiter is a pipe '|', the id is a character (in this case 'a' or 'c'), and there's a whitespace for legibility here. We are interested in the value of the c subfield if the a subfield starts with Wing or STC (and we should be case-insensitive to be safe). Whew.

Ok! With a Wing/STC identifier in hand, we can teach a computer to look up a DEEP entry, which at this point is another post.

Saturday, February 13, 2010

Link dump

http://www.ibm.com/developerworks/webservices/library/ws-restwsdl/
http://www.keith-chapman.org/2008/09/restfull-mashup-with-wsdl-20-wso2.html

Saturday, January 30, 2010

Jackrabbit, RMI, etc.

JCR Remote Repo

Should a variation on Server implement javax.jcr.Repository? Or a modularized wrapper? Will need access to either JAAS or internal machinery for authN, as well.