Moving from CPU-bound to memory bound. The large number of docs in the secondary index is making the bit vectors ungainly, I think. Maybe I'll try testing performance using the scorer directly, like a doc id iterator, for the bigram term resolution.
The bigram resolution is fast, and also (embarrassingly) more accurate: It looks like Shanghai was a more imperfect algorithm than I thought. Oh well.
Result documents are trickier: Maybe I'll change the return type for docs() to send back a scorer as well?
If that goes well, I may also try out the fast bitset implementation hiding in the trunk of lucene.util.