Thursday, May 5, 2022

Spring Check-In

In February, when we realized this year's Ruby and Rails EOLs had significant implications for our shop, I tilted a lot of effort towards getting blacklight 7.x "in order" for our migration.

I am almost seeing the daylight of "and now we can migrate our remaining apps on to Ruby 3/Rails 6+/Blacklight 7+", but this effort, combined with the drumbeat of project development work and team management... it has been a really intense spring.

Along the way I've become a maintainer of code4lib/ruby-oai, where there are now a tiny handful of inherited PRs (3, with unclear interest) and tickets (6, one of which is Ruby 3 compatibility and a major rev) - after hacking through 12 unattended PRs running back to 2015 or so. This also took me on a brief detour into configuring shared-cache SQLite and ActiveRecord, though the tests that initiated it turned out to be unnecessary. 

We've cut a release of Blacklight's OAI provider making use of that ruby-oai release (now down to a single ticket for cutting an additional release at difference ruby/linter requirements).

We're plowing through a bunch of Blacklight tickets - there's a pretty dramatic dent in the open PRs, and I think the tickets that apply to 7.x with consensus support are all merged.

That's leaving the release of a Blacklight 7-friendly refactor of the blacklight_range_limit plugin, which is nearly done.

I'm fairly certain that the "rampage of feckless merge commits" (thanks code4lib#blacklight) around all this has annoyed my co-committers to no end; I can verify that more than one direct message has essentially said "I cannot remember why I opened that PR". I think, though, that cleaning it all out makes the project(s) more approachable in addition to being required work for our team.

Trey Pendragon has made a stalwart effort towards getting the Blacklight committers on a regular meeting schedule again, and I'm slowly trying to re-articulate myself to some form of the Samvera conversation. Somehow in all this I've also been volunteering on a DLF planning committee and recruiting, but fingers crossed that link will be dead soon and I'll be able to rejuvenate our efforts towards the other vacancy on our team.

I am hopeful that all this also indicates a mutual, if unvoiced, recognition and rejection of the malaise of the last two (or five!) years. On the other hand: At the end of last summer, my boss of 15 years announced his retirement at year's end. I intended to apply for that job, but it's gone unposted. I have to admit the possibility, then, that a part of all this is a sublimation of the anxiety and frustration of that situation - and with a backdrop of pandemics, wars, and dismaying politics, it's very easy to slip into thinking that work is the air around you.

I've maintained some resolve towards weaning myself off Twitter as the way to stay on top of both professional developments and distant friendships - cleaned up the feed, and just not posted. Hilariously, thanks to the aforementioned recruiting, I've had a taste of the LinkedIn experience and... it is not for me. I've experimented with mastodon, but I want to make an effort towards the blog - and frankly, towards slower interactions. The summer will tell how successful that is.

Wednesday, April 27, 2022

ActiveRecord, SQLite, URI Database name tokens

Blogging an answer I posted at Stack Overflow:

I ran into an issue recently with a library I was testing - the in-memory SQLite database wasn't shared between threads in the testing process. This can be accommodated in SQLite with a shared cache, and this should be usable in ActiveRecord by configuring the database connection with a URI filename... but it wasn't working.

If your SQLite build did not set the SQLITE_USE_URI flag to true, then after SQLite v3.38.2 it will default to false. As the folks at S/O observe, you will see files created with the name of the URI token.

You can work around this by passing the appropriate bit switches to SQLite via the flags parameter on your ActiveRecord connection.

In particular, you will want:

  • SQLite3::Constants::Open::READWRITE (0x02)
  • SQLite3::Constants::Open::CREATE (0x04)
  • SQLite3::Constants::Open::URI (0x40)
... which is to say:

ActiveRecord::Base.establish_connection {

adapter: "sqlite3",

database: "file::memory:?cache=shared",

flags: 70 # SQLite3::Constants::Open::READWRITE | CREATE | URI

}

 

Monday, April 5, 2021

Featured Searches in a Blacklight/Solr Web Application

We've been working on a project to allow a Solr-backed application (an institutional repository built on Blacklight) to display configured "search features" - something akin to the breakout information boxes in a search engine - when a search is strongly affiliated with an organizational partner/journal/etc.

Rather than trying to predict actual searches that should be associated with a feature, we decided to leverage the facets in a given result set - the Search Feature is associated with a faceted field and one or more values. When a search is executed and there are facet values matching results in excess of a 'component threshold' (for example, 16% of the result set), we query a database for matched Features, and compare the aggregated tallies for the Feature to a display threshold (pretty high - 80% or more in testing).  This might be obvious to Blacklight veterans, but sorting the facets in the result set by count rather than by value is what makes this all possible.

Working against the analyzed result set rather than a predetermined set of queries permits the display of a Feature to be more emergent (for example, catching common acronyms for an academic unit or journal), but still accommodates a 'stable' link to a Feature that redirects to a search using the associated facet values as a filter query.





The data model for a Feature is pretty simple in its initial iteration - a slug identifier, a category (which maps to a faceted field), a description, links to a logo and external web site, and the associated facet values. We use the Feature data in two contexts - in the search results (with a compact display that can be expanded to show the description), and in "explore" pages presenting all the features for some categories. The application in question already has some authorization-restricted pages, so we were able to stand up a simple CRUD user interface for the features allowing us to delegate content management.





The stable links for features take advantage of duck-typing and Blacklight's deep-hash configuration to allow establishment of a filtered search context without precluding further filtering on the facet associated with a Feature's category: We define a query facet, but configure it not to display. Rather than an explicit query hash (which would be used to write out user-selectable values in a displayed facet), we have a "lazy" query proxy that implements the bracket method and builds named filters based on the configured facet values for a Feature, retrieved by the slug.




We anticipate ongoing work in scoring the Features - the limited data to begin with means very simple rules like "the top feature from each category" are sufficient to get us started - and in content management - making the descriptions Markdown seems likely in the foreseeable future. I'm interested to see how this develops in use, particularly in the context of some important counterpart efforts: We're also developing a reusable search "widget" to surface content associations in the university's centralized web content management system, and we are leveraging OJS's SWORD plugins to deposit articles from hosted journals immediately on publication (the hosted journals are all Features). Together these efforts suggest an intriguing capacity for our institutional repository to function as a partner platform.

Our IR is developed in a public source repository, so if you're interested in tracking this effort as a Blacklight developer you can find us on Github: https://github.com/cul/ac-academiccommons
 

Saturday, May 23, 2020

Follow up on pulling Internet Archive ebooks data for reuse

Following up on a recent post - and making belated good on a promise to a colleague, sorry Alex!

  1. I touched on pagination in that post, but didn't mention sorting! IA's search api won't have predictable response order without specifying some kind of sort.
  2. I put together a Python script that I think embodies what I wanted to document in that post.


I usually work in Ruby (or, you know - bash) these days, but I'm trying to knock the rust off with Python, for a few reasons:

  1. It's the preferred language of my code-friendly colleagues, and I prefer to be able to just share an annotated script for tasks of mutual interest.
  2. I've been playing around a little with rendering 3D models, and my harrowing experience with OpenCV and Ruby motivates me to just use the dang Python bindings for, say, Blender. Could you imagine, writing some FFI rig for Blender? No, let's just get on with it, thanks.
  3. I got rusty! You can't get rusty. I realize I've just summoned a Java project into my future.

Tuesday, May 19, 2020

Numpy Surprise: "Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead"

You've got to hand it to the numpy contributors, it's a pretty forthright error message.
If you are working with some old python and encounter:
ValueError: Non-string object detected for the array ordering. Please pass in 'C', 'F', 'A', or 'K' instead

... then this might be helpful: Numpy's ndarray order is an enum in the C API, and prior to v1.4.0 (!!!) python clients passed the corresponding value constants as arguments to methods like flatten (for example) directly.

In v1.4.0 there's an argument parser introduced that maps the strings from the error messages to the enum values. It's pretty straightforward to do the mapping if you know what changed and look up the enum, but for convenience's sake:

pre-1.4.0 argumentOrder Enumpost-v1.4.0 argument
-1NPY_ANYORDER"A"
0NPY_CORDER "C"
1NPY_FORTRANORDER"F"
2NPY_KEEPORDER"K"

I'm sure no one else out there is looking at decade-old python, but just in case.

Monday, May 11, 2020

Rough and Ready Guide to Pulling Columbia University Libraries eBooks from IA for Reuse

As the basis for the examples here, I am referring to Columbia University Libraries' Muslim World Manuscripts upload, browsable at https://archive.org/details/muslim-world-manuscripts; but you might also identify a collection in the "Collection" facet at https://archive.org/details/ColumbiaUniversityLibraries.

There is a search UI for archive.org, and a pretty nice python client, too; but I will shortcut here to the resulting search URL for a collection identified above in the URL segment after /details/ (which produces a 'q' parameter value of "collection:muslim-world-manuscripts"):

https://archive.org/advancedsearch.php?q=collection%3Amuslim-world-manuscripts&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=

(this example removes the default JSONP callback value for clarity as json)

This response is paginated (in this example 50 docs per page from the 'rows' parameter, pages numbered from 1 in the URL by the 'page' parameter).

Parse returned JSON; referred to from here as parsed_json.

The total number of rows is available at parsed_json['response']['numFound'] - use this to determine how many pages of content there are, or to try to fetch them in one page (if it's a modest number).

Iterate over docs at parsed_json['response']['docs'] -

Docs will generally have a link back to CLIO under the key 'stripped_tags'; if you can match the pattern /http:\/\/clio.columbia.edu\/catalog\/([0-9]+)/ then appending '.marcxml' will allow you to download more detailed metadata from CLIO.

If stripped_tags does not provide this information, many (but not all) CUL docs have an identifier format that indicates a catalog id, e.g. ldpd_14230809_000 - the middle part of the id, delineated by underscores ('_'), is a record identifier usable in the CLIO url patterns above in place of the grouped match (the last segment before the added '.marcxml').

Absent that, item details are available at archive.org as detailed below. There's also some metadata included in the docs from IA, but outside of 'date' it is often aggregated in an array/list under the tag 'description'. Some of the list members may have a field-like prefix (eg, "Shelfmark: "), but the data from CLIO (if available) will be more certain.

each doc will have a value under the key 'identifier' which can be use for downloading content from IA:

metadata details: "https://archive.org/metadata/#{identifier}" (see also the metadata API docs)
thumbnail: "https://archive.org/services/img/#{identifier}"
poster: "https://archive.org/download/#{identifier}/page/cover_medium.jpg"
details: "https://archive.org/details/#{identifier}"
embeddable viewer (iframe): "https://archive.org/stream/#{identifier}"?ui=full&showNavbar=false"
download pdf: "https://archive.org/download/#{identifier}/#{identifier}.pdf"

Wednesday, February 5, 2020

A Job on a Team I know Very Well

This post will be elaborated ASAP, but I'm excited to start describing a job on the team I manage:

https://opportunities.columbia.edu/en-us/job/506308/developer-for-digital-research-and-scholarship

The principal portfolio is in tooling to support DH projects; the predecessor, Marii Nyröp, is the developer behind the Wax static site generator: https://minicomp.github.io/wax/

It's a junior position with the most forgiving experience prerequisites we could manage, but I like to think our team has a track record of mentorship and professional development.

The incumbent would join a team with a lot of experience and an appetite for learning and improving.  We're in the midst of our first big push towards React/GraphQL over Rails and Solr. We use (and have imminent plans to elaborate our implementation of) IIIF.  There's a Software Carpentries program with certification opportunities. More soon!