Saturday, November 22, 2008

Mapping note

Mapping structured citations: If there is a successful sided match, add the unsided cite to the matched list to preclude false recto/verso matches.

Friday, November 21, 2008

Concordia, naming, sparse relationships
Numbers server already resolves (ultimately) to RDF
low-hanging fruit: adding some known metadata properties to the leaves
high-hanging fruit: disentangling the many sparse relationships
What if rather than having any organizational center for the index, we reorganized things around a more abstracted graph relating Objects (inventory numbers), Texts (citations), and CatalogEntries (metadata records, these might be Editions)

Say we have two more relationships: METADATA ore:describes TEXT/OBJECT, and METADATA ore:similarTo METADATA

Number Server lets you drill down through identifier hierarchies as aggregates, OR lets you see a graph centered on a particular URI.

More at some indeterminate point in the future.

Wednesday, November 12, 2008

Encoding javascript utf16 characters for urls

Dug this up from the bowels of the ddbdp webapp code. I wrote it a long time ago, but didn't end up needing it. Seems like it might be useful down the line...

var firstByteMark = [ 0x00,0x00,0xC0,0xE0,0xF0,0xF8,0xFC ];
var byteMask = 0xBF;
var byteMark = 0x80;

function UTF16toUTF8Bytes(u16){
var bytes = new Array();
if (u16 < 128){
bytes.length = 1;
} else if (u16 < 2048){
bytes.length = 2;
} else { // presuming max js charCode of 65535
bytes.length = 3;
switch (bytes.length){
case 3:
bytes[2] = ((u16 | byteMark) & byteMask);
u16 >>= 6;
case 2:
bytes[1] = ((u16 | byteMark) & byteMask);
u16 >>= 6;
case 1:
bytes[0] = (u16 | firstByteMark[bytes.length]);
return bytes;
function encode(input){
var output = new Array();
var inputArray = input.split(/\s+/);
for(var i=0;i<inputArray.length;i++){
var term = '';
for(var j=0;j<inputArray[i].length;j++){
var u16 = inputArray[i].charCodeAt(j);
if (u16 < 128){
term += inputArray[i].charAt(j);
var utf8bytes = UTF16toUTF8Bytes(u16);
for(var k=0;k<utf8bytes.length;k++){
if(utf8bytes[k] < 16){
term += "%0";
} else {
term += "%";
term += utf8bytes[k].toString(16);
output[i] = term;
return output;

Monday, November 10, 2008

Lexington Prep

  • wiki outline

  • fix the tests to operate against more properly contained data

  • continued refactoring/cleanup

  • get the project components into the new repository - in progress:

Friday, October 24, 2008

Lucene Customization Performance

Hardware: Athlon 64 X2 Dual 3800; 2GB RAM
Data: Lexicon of 273365 terms

Coarse Response Time (50 iterations on a single substring query)
Query TypeRough System Timing (System.current…)
indexOf query8078
Wildcard query11078
Bigrams Span query2657

HPROF Results (Single terms and phrases)
MetricWildcard (custom)Bigram Spans (custom)%difference
Cpu (total)3423320312130249-64.5%
 Wildcard (custom)Bigrams (custom)%difference
Cpu (total)342332036229156-81.8%

Wednesday, September 3, 2008


* new interface templates
* update standalone servlet with metadata filters
* document logging
* check OAI refreshes
* ZA XSLT for PN
* APIS images (eRez) for qualifying hgv views
* fix highlighting of HGV pub in metadata search results
* fix image/trans float-to-top in ddb
** correct hasImage index
** propagate sort flags on paging links
** filter "keine" from bibl.illustration

Monday, July 21, 2008

Unbound, Bound again

Moving from CPU-bound to memory bound. The large number of docs in the secondary index is making the bit vectors ungainly, I think. Maybe I'll try testing performance using the scorer directly, like a doc id iterator, for the bigram term resolution.

The bigram resolution is fast, and also (embarrassingly) more accurate: It looks like Shanghai was a more imperfect algorithm than I thought. Oh well.

Result documents are trickier: Maybe I'll change the return type for docs() to send back a scorer as well?

If that goes well, I may also try out the fast bitset implementation hiding in the trunk of lucene.util.

Thursday, July 17, 2008

Spans are slow

... but accurate. Since most queries include an exact phrase match as a subspan, I'm looking into an optimized SpanQuery for inorder slop=0 queries. This should just require a derivative of NearSpansOrdered that has matching logic tuned for the lack of slop and ordered subspans.

A highlighting note: Much better results when I changed the merging logic to test that the merged fragment would have an improved (higher) score. But a performance hit. Reusing the scorer helped a bit. I whipped up a vectorized CachingTokenFilter implementation to cut down on memory- Fewer objects, lazier loading. Needs more testing. Anyway, getting it all to work meant adding some start- and end-Token fields to the TextFragment implementation, so that the cached token stream could be used to score the hypothetical new fragment. Re-tokenizing and creating a new scorer was way too slow.

One more thing: The .end() of a span is the position following the span, so in an ordered sequence span-n.end() should be equal to span-n+1.start(). I think. I'll look more into subspan overlap and see.

Tuesday, July 15, 2008

For Pete's Sake

Lucene 1.4.3

Guh. Have to make do with what's there, have less than 2 weeks.

Edit: Also Tomcat 1.4. ffs.

Friday, July 11, 2008

on to the next bad idea?

Mini-dsl is looking pretty good. I think it's time to revisit the "foreign keys" I'm using to associate records in the 2 Lucene indices.

Lucene is slow when your application tries to effect joins, building up a big BooleanQuery follow-up to a query on one index to get related documents from another. I got around the performance hit to a large extent by storing the related foreign doc id's as binary fields in each index. Then I just scooped the values up with a bit vector, and it was like I had executed the search.

Except that it's very easy to knock the indices out of synch. Also to be determined is the number of places the crosswalk data will be stored, and where the ORE feed will draw its data from.

Wednesday, July 9, 2008

Why not CQL? Why not XTF?

As the good folks at XTF already know, the semantics of text searching are slightly different from those for searching structured metadata.

I had been trying to encode my queries as CQL, but there are some problems:
  • It's painfully verbose, resulting in ridiculous URLs
  • (( A prox B) not (( A prox B) prox C)) is not equivalent to (A near B) not near C

XTF would be nice, except:
  • The tokenization and display requirements of the DDb project appear to be beyond the XTF customization options
  • Substring searching
  • Lemmatized forms
  • Honestly, I'm still a little unsatisfied with the way queries are encoded in URLs

If I slog along with my current collection of tokenizers and indexers, I still need a way to make the url query encoding both more transparent and flexible. My thought of the week (which is showing some progress) is embedding a stripped-down javascript parser, limiting it to the js native types (Strings, Numbers, maybe functions), and creating basically a little DSL to express the queries.

So far, things look pretty promising. I think I can capture the text searching requirements pretty well with 8 or 9 defined functions and a few "barewords" to indicate mode of sensitivity to case, etc.

An ugly CQL example:
(((cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^kai^"
prox/unit=word/distance<=1 cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^upoqhkhs^") prox/unit=word/distance<=2 (cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents"^dik")))

versus something like:
then( beta("^kai^ ^upoqhkhs^",IA), beta("^dik",IA), 2 )