Thursday, July 17, 2008

Spans are slow

... but accurate. Since most queries include an exact phrase match as a subspan, I'm looking into an optimized SpanQuery for inorder slop=0 queries. This should just require a derivative of NearSpansOrdered that has matching logic tuned for the lack of slop and ordered subspans.

A highlighting note: Much better results when I changed the merging logic to test that the merged fragment would have an improved (higher) score. But a performance hit. Reusing the scorer helped a bit. I whipped up a vectorized CachingTokenFilter implementation to cut down on memory- Fewer objects, lazier loading. Needs more testing. Anyway, getting it all to work meant adding some start- and end-Token fields to the TextFragment implementation, so that the cached token stream could be used to score the hypothetical new fragment. Re-tokenizing and creating a new scorer was way too slow.

One more thing: The .end() of a span is the position following the span, so in an ordered sequence span-n.end() should be equal to span-n+1.start(). I think. I'll look more into subspan overlap and see.

No comments: