Tuesday, December 20, 2016

Just a bunch of ESTC Library Names

Following Meaghan Brown on trying to match STC and ESTC library names, I threw together a quickie ruby script that parses all the library names from the ESTC library name browse list, then follows the "Next Page" links while they are present and grabs the next set.



#!/usr/bin/env ruby
require 'open-uri'
require 'nokogiri'
url = 'http://estc.bl.uk/F/1K33ABJEHVSDU7X7HFCTYFGQB3S7UMUTKG5U84RR5SY7GJYM89-10241?func=scan&scan_start=000209485&scan_code=INT&find_scan_code=INT&scan_op=PREV'
titles = open('titles.txt','a')
while url do
open(url) do |blob|
page = Nokogiri::HTML(blob)
next_page = page.css("img[alt='Next Page']").first
if next_page
url = next_page.parent['href']
else
url = false
end
page.css("td.td1>a").each do |link|
titles.write "#{link.text}\n"
end
end
end
titles.close
Pretty dodgy, and as before, the ESTC doesn't do you any favors with the markup, but here's what I got.

No comments: