Monday, July 21, 2008

Unbound, Bound again

Moving from CPU-bound to memory bound. The large number of docs in the secondary index is making the bit vectors ungainly, I think. Maybe I'll try testing performance using the scorer directly, like a doc id iterator, for the bigram term resolution.

The bigram resolution is fast, and also (embarrassingly) more accurate: It looks like Shanghai was a more imperfect algorithm than I thought. Oh well.

Result documents are trickier: Maybe I'll change the return type for docs() to send back a scorer as well?

If that goes well, I may also try out the fast bitset implementation hiding in the trunk of lucene.util.

Thursday, July 17, 2008

Spans are slow

... but accurate. Since most queries include an exact phrase match as a subspan, I'm looking into an optimized SpanQuery for inorder slop=0 queries. This should just require a derivative of NearSpansOrdered that has matching logic tuned for the lack of slop and ordered subspans.

A highlighting note: Much better results when I changed the merging logic to test that the merged fragment would have an improved (higher) score. But a performance hit. Reusing the scorer helped a bit. I whipped up a vectorized CachingTokenFilter implementation to cut down on memory- Fewer objects, lazier loading. Needs more testing. Anyway, getting it all to work meant adding some start- and end-Token fields to the TextFragment implementation, so that the cached token stream could be used to score the hypothetical new fragment. Re-tokenizing and creating a new scorer was way too slow.

One more thing: The .end() of a span is the position following the span, so in an ordered sequence span-n.end() should be equal to span-n+1.start(). I think. I'll look more into subspan overlap and see.

Tuesday, July 15, 2008

For Pete's Sake

Lucene 1.4.3

Guh. Have to make do with what's there, have less than 2 weeks.

Edit: Also Tomcat 1.4. ffs.

Friday, July 11, 2008

on to the next bad idea?

Mini-dsl is looking pretty good. I think it's time to revisit the "foreign keys" I'm using to associate records in the 2 Lucene indices.

Lucene is slow when your application tries to effect joins, building up a big BooleanQuery follow-up to a query on one index to get related documents from another. I got around the performance hit to a large extent by storing the related foreign doc id's as binary fields in each index. Then I just scooped the values up with a bit vector, and it was like I had executed the search.

Except that it's very easy to knock the indices out of synch. Also to be determined is the number of places the crosswalk data will be stored, and where the ORE feed will draw its data from.

Wednesday, July 9, 2008

Why not CQL? Why not XTF?

As the good folks at XTF already know, the semantics of text searching are slightly different from those for searching structured metadata.

I had been trying to encode my queries as CQL, but there are some problems:
  • It's painfully verbose, resulting in ridiculous URLs
  • (( A prox B) not (( A prox B) prox C)) is not equivalent to (A near B) not near C

XTF would be nice, except:
  • The tokenization and display requirements of the DDb project appear to be beyond the XTF customization options
  • Substring searching
  • Lemmatized forms
  • Honestly, I'm still a little unsatisfied with the way queries are encoded in URLs

If I slog along with my current collection of tokenizers and indexers, I still need a way to make the url query encoding both more transparent and flexible. My thought of the week (which is showing some progress) is embedding a stripped-down javascript parser, limiting it to the js native types (Strings, Numbers, maybe functions), and creating basically a little DSL to express the queries.

So far, things look pretty promising. I think I can capture the text searching requirements pretty well with 8 or 9 defined functions and a few "barewords" to indicate mode of sensitivity to case, etc.

An ugly CQL example:
(((cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^kai^"
prox/unit=word/distance<=1 cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^upoqhkhs^") prox/unit=word/distance<=2 (cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents"^dik")))

versus something like:
then( beta("^kai^ ^upoqhkhs^",IA), beta("^dik",IA), 2 )