Saturday, May 23, 2020

Follow up on pulling Internet Archive ebooks data for reuse

Following up on a recent post - and making belated good on a promise to a colleague, sorry Alex!

  1. I touched on pagination in that post, but didn't mention sorting! IA's search api won't have predictable response order without specifying some kind of sort.
  2. I put together a Python script that I think embodies what I wanted to document in that post.


I usually work in Ruby (or, you know - bash) these days, but I'm trying to knock the rust off with Python, for a few reasons:

  1. It's the preferred language of my code-friendly colleagues, and I prefer to be able to just share an annotated script for tasks of mutual interest.
  2. I've been playing around a little with rendering 3D models, and my harrowing experience with OpenCV and Ruby motivates me to just use the dang Python bindings for, say, Blender. Could you imagine, writing some FFI rig for Blender? No, let's just get on with it, thanks.
  3. I got rusty! You can't get rusty. I realize I've just summoned a Java project into my future.

Tuesday, May 19, 2020

Numpy Surprise: "Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead"

You've got to hand it to the numpy contributors, it's a pretty forthright error message.
If you are working with some old python and encounter:
ValueError: Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead

... then this might be helpful: Numpy's ndarray order is an enum in the C API, and prior to v1.4.0 (!!!) python clients passed the corresponding value constants as arguments to methods like flatten (for example) directly.

In v1.4.0 there's an argument parser introduced that maps the strings from the error messages to the enum values. It's pretty straightforward to do the mapping if you know what changed and look up the enum, but for convenience's sake:

pre-1.4.0 argumentOrder Enumpost-v1.4.0 argument
-1NPY_ANYORDER"A"
0NPY_CORDER "C"
1NPY_FORTRANORDER"F"
2NPY_KEEPORDER"K"

I'm sure no one else out there is looking at decade-old python, but just in case.

Monday, May 11, 2020

Rough and Ready Guide to Pulling Columbia University Libraries eBooks from IA for Reuse

As the basis for the examples here, I am referring to Columbia University Libraries' Muslim World Manuscripts upload, browsable at https://archive.org/details/muslim-world-manuscripts; but you might also identify a collection in the "Collection" facet at https://archive.org/details/ColumbiaUniversityLibraries.

There is a search UI for archive.org, and a pretty nice python client, too; but I will shortcut here to the resulting search URL for a collection identified above in the URL segment after /details/ (which produces a 'q' parameter value of "collection:muslim-world-manuscripts"):

https://archive.org/advancedsearch.php?q=collection%3Amuslim-world-manuscripts&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=

(this example removes the default JSONP callback value for clarity as json)

This response is paginated (in this example 50 docs per page from the 'rows' parameter, pages numbered from 1 in the URL by the 'page' parameter).

Parse returned JSON; referred to from here as parsed_json.

The total number of rows is available at parsed_json['response']['numFound'] - use this to determine how many pages of content there are, or to try to fetch them in one page (if it's a modest number).

Iterate over docs at parsed_json['response']['docs'] -

Docs will generally have a link back to CLIO under the key 'stripped_tags'; if you can match the pattern /http:\/\/clio.columbia.edu\/catalog\/([0-9]+)/ then appending '.marcxml' will allow you to download more detailed metadata from CLIO.

If stripped_tags does not provide this information, many (but not all) CUL docs have an identifier format that indicates a catalog id, e.g. ldpd_14230809_000 - the middle part of the id, delineated by underscores ('_'), is a record identifier usable in the CLIO url patterns above in place of the grouped match (the last segment before the added '.marcxml').

Absent that, item details are available at archive.org as detailed below. There's also some metadata included in the docs from IA, but outside of 'date' it is often aggregated in an array/list under the tag 'description'. Some of the list members may have a field-like prefix (eg, "Shelfmark: "), but the data from CLIO (if available) will be more certain.

each doc will have a value under the key 'identifier' which can be use for downloading content from IA:

metadata details: "https://archive.org/metadata/#{identifier}" (see also the metadata API docs)
thumbnail: "https://archive.org/services/img/#{identifier}"
poster: "https://archive.org/download/#{identifier}/page/cover_medium.jpg"
details: "https://archive.org/details/#{identifier}"
embeddable viewer (iframe): "https://archive.org/stream/#{identifier}"?ui=full&showNavbar=false"
download pdf: "https://archive.org/download/#{identifier}/#{identifier}.pdf"