Monday, May 11, 2020

Rough and Ready Guide to Pulling Columbia University Libraries eBooks from IA for Reuse

As the basis for the examples here, I am referring to Columbia University Libraries' Muslim World Manuscripts upload, browsable at https://archive.org/details/muslim-world-manuscripts; but you might also identify a collection in the "Collection" facet at https://archive.org/details/ColumbiaUniversityLibraries.

There is a search UI for archive.org, and a pretty nice python client, too; but I will shortcut here to the resulting search URL for a collection identified above in the URL segment after /details/ (which produces a 'q' parameter value of "collection:muslim-world-manuscripts"):

https://archive.org/advancedsearch.php?q=collection%3Amuslim-world-manuscripts&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=

(this example removes the default JSONP callback value for clarity as json)

This response is paginated (in this example 50 docs per page from the 'rows' parameter, pages numbered from 1 in the URL by the 'page' parameter).

Parse returned JSON; referred to from here as parsed_json.

The total number of rows is available at parsed_json['response']['numFound'] - use this to determine how many pages of content there are, or to try to fetch them in one page (if it's a modest number).

Iterate over docs at parsed_json['response']['docs'] -

Docs will generally have a link back to CLIO under the key 'stripped_tags'; if you can match the pattern /http:\/\/clio.columbia.edu\/catalog\/([0-9]+)/ then appending '.marcxml' will allow you to download more detailed metadata from CLIO.

If stripped_tags does not provide this information, many (but not all) CUL docs have an identifier format that indicates a catalog id, e.g. ldpd_14230809_000 - the middle part of the id, delineated by underscores ('_'), is a record identifier usable in the CLIO url patterns above in place of the grouped match (the last segment before the added '.marcxml').

Absent that, item details are available at archive.org as detailed below. There's also some metadata included in the docs from IA, but outside of 'date' it is often aggregated in an array/list under the tag 'description'. Some of the list members may have a field-like prefix (eg, "Shelfmark: "), but the data from CLIO (if available) will be more certain.

each doc will have a value under the key 'identifier' which can be use for downloading content from IA:

metadata details: "https://archive.org/metadata/#{identifier}" (see also the metadata API docs)
thumbnail: "https://archive.org/services/img/#{identifier}"
poster: "https://archive.org/download/#{identifier}/page/cover_medium.jpg"
details: "https://archive.org/details/#{identifier}"
embeddable viewer (iframe): "https://archive.org/stream/#{identifier}"?ui=full&showNavbar=false"
download pdf: "https://archive.org/download/#{identifier}/#{identifier}.pdf"

No comments: