Saturday, May 23, 2020

Follow up on pulling Internet Archive ebooks data for reuse

Following up on a recent post - and making belated good on a promise to a colleague, sorry Alex!

  1. I touched on pagination in that post, but didn't mention sorting! IA's search api won't have predictable response order without specifying some kind of sort.
  2. I put together a Python script that I think embodies what I wanted to document in that post.


I usually work in Ruby (or, you know - bash) these days, but I'm trying to knock the rust off with Python, for a few reasons:

  1. It's the preferred language of my code-friendly colleagues, and I prefer to be able to just share an annotated script for tasks of mutual interest.
  2. I've been playing around a little with rendering 3D models, and my harrowing experience with OpenCV and Ruby motivates me to just use the dang Python bindings for, say, Blender. Could you imagine, writing some FFI rig for Blender? No, let's just get on with it, thanks.
  3. I got rusty! You can't get rusty. I realize I've just summoned a Java project into my future.

Tuesday, May 19, 2020

Numpy Surprise: "Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead"

You've got to hand it to the numpy contributors, it's a pretty forthright error message.
If you are working with some old python and encounter:
ValueError: Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead

... then this might be helpful: Numpy's ndarray order is an enum in the C API, and prior to v1.4.0 (!!!) python clients passed the corresponding value constants as arguments to methods like flatten (for example) directly.

In v1.4.0 there's an argument parser introduced that maps the strings from the error messages to the enum values. It's pretty straightforward to do the mapping if you know what changed and look up the enum, but for convenience's sake:

pre-1.4.0 argumentOrder Enumpost-v1.4.0 argument
-1NPY_ANYORDER"A"
0NPY_CORDER "C"
1NPY_FORTRANORDER"F"
2NPY_KEEPORDER"K"

I'm sure no one else out there is looking at decade-old python, but just in case.

Monday, May 11, 2020

Rough and Ready Guide to Pulling Columbia University Libraries eBooks from IA for Reuse

As the basis for the examples here, I am referring to Columbia University Libraries' Muslim World Manuscripts upload, browsable at https://archive.org/details/muslim-world-manuscripts; but you might also identify a collection in the "Collection" facet at https://archive.org/details/ColumbiaUniversityLibraries.

There is a search UI for archive.org, and a pretty nice python client, too; but I will shortcut here to the resulting search URL for a collection identified above in the URL segment after /details/ (which produces a 'q' parameter value of "collection:muslim-world-manuscripts"):

https://archive.org/advancedsearch.php?q=collection%3Amuslim-world-manuscripts&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=

(this example removes the default JSONP callback value for clarity as json)

This response is paginated (in this example 50 docs per page from the 'rows' parameter, pages numbered from 1 in the URL by the 'page' parameter).

Parse returned JSON; referred to from here as parsed_json.

The total number of rows is available at parsed_json['response']['numFound'] - use this to determine how many pages of content there are, or to try to fetch them in one page (if it's a modest number).

Iterate over docs at parsed_json['response']['docs'] -

Docs will generally have a link back to CLIO under the key 'stripped_tags'; if you can match the pattern /http:\/\/clio.columbia.edu\/catalog\/([0-9]+)/ then appending '.marcxml' will allow you to download more detailed metadata from CLIO.

If stripped_tags does not provide this information, many (but not all) CUL docs have an identifier format that indicates a catalog id, e.g. ldpd_14230809_000 - the middle part of the id, delineated by underscores ('_'), is a record identifier usable in the CLIO url patterns above in place of the grouped match (the last segment before the added '.marcxml').

Absent that, item details are available at archive.org as detailed below. There's also some metadata included in the docs from IA, but outside of 'date' it is often aggregated in an array/list under the tag 'description'. Some of the list members may have a field-like prefix (eg, "Shelfmark: "), but the data from CLIO (if available) will be more certain.

each doc will have a value under the key 'identifier' which can be use for downloading content from IA:

metadata details: "https://archive.org/metadata/#{identifier}" (see also the metadata API docs)
thumbnail: "https://archive.org/services/img/#{identifier}"
poster: "https://archive.org/download/#{identifier}/page/cover_medium.jpg"
details: "https://archive.org/details/#{identifier}"
embeddable viewer (iframe): "https://archive.org/stream/#{identifier}"?ui=full&showNavbar=false"
download pdf: "https://archive.org/download/#{identifier}/#{identifier}.pdf"

Wednesday, February 5, 2020

A Job on a Team I know Very Well

This post will be elaborated ASAP, but I'm excited to start describing a job on the team I manage:

https://opportunities.columbia.edu/en-us/job/506308/developer-for-digital-research-and-scholarship

The principal portfolio is in tooling to support DH projects; the predecessor, Marii Nyröp, is the developer behind the Wax static site generator: https://minicomp.github.io/wax/

It's a junior position with the most forgiving experience prerequisites we could manage, but I like to think our team has a track record of mentorship and professional development.

The incumbent would join a team with a lot of experience and an appetite for learning and improving.  We're in the midst of our first big push towards React/GraphQL over Rails and Solr. We use (and have imminent plans to elaborate our implementation of) IIIF.  There's a Software Carpentries program with certification opportunities. More soon!

Friday, August 23, 2019

The Island and the Archipelago

This post does not begin with an orthogonal observation about NYC as archipelago, but: Close Rikers.

On 22 August 2019 I sat in on a meeting of the Archipelago Advisory Board at METRO.

METRO developers describe Archipelago as:
... an evolving Open Source Digital Objects Repository / DAM Server Architecture based on the popular CMS Drupal8/9 and ... a mix of deeply integrated custom-coded Drupal8 modules (made with care by us) and a curated and well-configured Drupal8 instance, running under a discrete and and well-planned set of service containers. All of this driven by a clear and concise but thoughtfully planned technical roadmap.
Archipelago was dreamt as a multi-tenant, distributed, capable system (as its name suggests!) and can live isolated or in flocks of similar deployments, sharing storage, services, or -- even better -- just the discovery layer. Learn more about the different Software Services used by Archipelago.
Archipelago's primary focus is to serve the GLAM community by providing a flexible, consistent, and unified way of describing, storing, linking, exposing metadata and media assets. We respect identities and existing workflows.
All of this operates under a different concept than the one we all have become used to in recent times.

I think this is a compelling project, but I want to push on that last sentence a bit.

Archipelago might be summarized as collapsing several parts of repository application system in the early 2010s mode - certainly the management application and the repository itself, possibly the reader/researcher-facing publications - into a Drupal management application. Part of this is accomplished by recognizing that at least some of the nuts-and-bolts work of the repository have been subsumed into services - for example, S3-bucket as storage API or a DOI source as an identification API.

Archipelago eschews a system-determined schema for objects in favor of json (I think json-ld, actually) as an object storage format, and leveraging the object interpretation of the stored json to expose the objects to Twig templates.

In the subsumption of the repository into a management and publication tool, Archipelago tracks with Princeton's Figgy (and, I might add, the work our team does at Columbia on Hyacinth, although we still publish to a Fedora installation). It is also not an alien trajectory to the one Stanford's SDR has been on - or Duke's Digital Repository, which was also recently redesigned more completely around Drupal.

A talk I gave at Open Repositories in- 2016? I forget when. The Dublin one, where the video disappeared. Anyway, in a talk about the future of Fedora Commons and APIs, I briefly digressed into the virtues of CDL's curation microservices (to the surprise, I think, of the CDL delegation) - but I think it's clear that even if Merritt didn't per se change the way we all go about this work, the footprints (feetprint?) of S3 and Datacite across digital libraries suggests that the disarticulation of the repository into a process of services has happened - a trend that continues in Archipelago, which disarticulates a storage service, a description service, and an index service/API (Solr in particular, but the particulars are not necessary or even especially interesting to me this morning).

Listening to the METRO folks (that is, the inimitable Diego Pino) discuss Archipelago's templating system, I found myself reconsidering along these lines the Fedora Commons 3 Content Model Architecture. No, seriously!

The CMA was an elaboration of Sandy Payette and Karl Lagoze's Flexible and Extensible Digital Object Repository Architecture (that's right, F E D O R A) disseminators into a quasi-SOAP, aspirationally object-oriented set of behaviors specified as best they could be in other repository objects, and linked with RDF assertions between the content-bearing objects and their linked type-defining objects. In a frictionless world, this is an excellent model.

Unfortunately, the Fedora 3 CMA was deployed in the frictional world of J2EE. The practical constraints of Fedora-side implementation meant that the syntax was arcane, fragile, and expensive to run (as the linked services, hidden behind REST-fully accessed object "property" URLs, made calls back to the repository to get the information they needed from the object for which they were building a response while the client waited and so on). Like the Handle architecture (that's right, I said it) things weren't necessarily this way - but the social, organizational and platform considerations of the day determined them.

Not very long after, the Hydra project (staffed by Fedora Commons committers, growing out of the Blacklight project at Virginia, and aiming to manage and index Fedora content) would begin developing what might be understood as an overlay approach to disseminators in Rails apps. While the mixins and gems of the resulting Rails framework might themselves not seem to track towards Twig templates, moving the environment of development into a platform (see also Islandora on Drupal and Emory University's analogous work on Django) that has more front-end concerns and a diversity of dynamically evaluated template options strikes me as a necessary conceptual step towards them. Or, if not necessary, supported by being less horrifying than storing JSP and recompiling them to operate against your object exposed as JAXB somehow. A chill runs up my spine.

This is all to say that I see the Archipelago project as being on a vector of repository work (and as noted above, not alone on it) that intersects previous work. It's not the only vector that does - we remain in a holding pattern about the distinct repository at my place of work, and I think there's service preservation/sustainability arguments that can be made for it, to say nothing of performance considerations - but I think it makes interesting observations about where to locate and value the labor of running, managing, and publishing digital collections. Its design approach also makes a claim about what the optimal balance of abstraction and community of practice is. I'm interested to see where it goes from here.


Thursday, July 13, 2017

Drifting

In a post reflecting on the software development practice in the Hydra/Samvera community, Jonathan Rochkind begins a late pivot towards a more general complaint by framing Samvera:

And finally, a particularly touchy evaluation of all for the hydra/samvera project; but the hydra project is 5-7 years old, long enough to evaluate some basic premises. I’m talking about the twin closely related requirements which have been more or less assumed by the community for most of the project’s history:
1) That the stack has to be based on fedora/fcrepo, and
2) that the stack has to be based on native RDF/linked data, or even coupled to RDF/linked data at all.
I believe these were uncontroversial assumptions rather than entirely conscious decisions, but I think it’s time to look back and wonder how well they’ve served us, and I’m not sure it’s well.

This 5 sentence history of Hydra/Samvera is a fabrication. The Hydra project began in 2008 as attempt to combine a Blacklight discovery layer and a Fedora 3 repository, debatably with the goal of improving the notion of services/disseminators in the Fedora 3 CMA by making them contained applications. The Fedora Commons project was one of its original partners. It's strange to characterize that backend as an assumption rather than the motivating use case when the core library from the project's onset is ActiveFedora (published February 2009).

I'm more sympathetic to interrogating the relationship of Samvera to linked data, but casting that decision as an assumption- rather than the conscious development goals of Hydra/Samvera partners who were trying focus their descriptions less on XML serializations and more on the description as data- is patronizing. I can agree that we should "look back and wonder how well they’ve served us", but it's always been the time to do that (as far as I can tell, the ActiveFedora:RDF package was introduced in 2013 as a reaction to frustration managing object descriptions as files). If I were an employee at Penn State, whose work prompted the accommodation of RDF as ActiveRecord-style properties rather than as serialized files, I'd be insulted to see my work characterized as the product of not "entirely conscious decisions" or "uncontroversial assumption[s]".

At more than one meeting now (in the interest of disclosure: I gave a talk on related topics at OR2016, participated in two panels touching on the issue at Hydra Connect 2016, and have been involved for some years with the Fedora Commons community and project), there's been open discussion of what the relationship of Samvera to Fedora ought to be going forward. It's clearly a question motivating the work on Valkyrie at Princeton. Over the years there's been more than one alternate backend written to mimic the Fedora APIs. There's analogous conversations in the world of Blacklight when a potential installer wants to use a different noSql store than Solr.

The critical question underlying those efforts and conversations is to what degree the software products of these various projects should be shareable, whether the surface of interaction is within a platform, across a shared index, across API abstractions, or at the achievement of consensus around use case and functional requirements. Rochkind takes a different tack, suggesting that a hard pivot away from abstraction should be the baseline and arguing that we need to justify any commitment beyond Rails and Paperclip. This strikes me as reductive and dismisses of my own experience: that generalizing description in a database moves pretty quickly towards re-inventing RDF in tables, and that storing blobs of serialized description leads to re-inventions of Fedora without the mediating APIs. If our response to the problems motivating Rochkind's post were to advocate interacting directly with the backing databases and file systems of Fedora, it might work - it might be faster! - but we would certainly not be proposing it as a minimalist path towards more sustainable software approaches.

We can scrutinize our approaches to the problem of managing assets and description in shareable ways without a fabulous and dismissive historical framing of the project and the use cases of its participating institutions. But we should also be cognizant of what *some kind* of abstraction yields: An Avalon or a Charon can function as a common tool to originate content subsequently repurposed for independent, locally developed publication platforms; I still think we'll inch towards shared practice, and thus shared content, with Islandora. Integrative projects like this require some kind of interface- the question is where to locate it.

Tuesday, December 20, 2016

Just a bunch of ESTC Library Names

Following Meaghan Brown on trying to match STC and ESTC library names, I threw together a quickie ruby script that parses all the library names from the ESTC library name browse list, then follows the "Next Page" links while they are present and grabs the next set.