Wednesday, July 9, 2008

Why not CQL? Why not XTF?

As the good folks at XTF already know, the semantics of text searching are slightly different from those for searching structured metadata.

I had been trying to encode my queries as CQL, but there are some problems:
  • It's painfully verbose, resulting in ridiculous URLs
  • (( A prox B) not (( A prox B) prox C)) is not equivalent to (A near B) not near C

XTF would be nice, except:
  • The tokenization and display requirements of the DDb project appear to be beyond the XTF customization options
  • Substring searching
  • Lemmatized forms
  • Honestly, I'm still a little unsatisfied with the way queries are encoded in URLs

If I slog along with my current collection of tokenizers and indexers, I still need a way to make the url query encoding both more transparent and flexible. My thought of the week (which is showing some progress) is embedding a stripped-down javascript parser, limiting it to the js native types (Strings, Numbers, maybe functions), and creating basically a little DSL to express the queries.

So far, things look pretty promising. I think I can capture the text searching requirements pretty well with 8 or 9 defined functions and a few "barewords" to indicate mode of sensitivity to case, etc.

An ugly CQL example:
(((cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^kai^"
prox/unit=word/distance<=1 cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents "^upoqhkhs^") prox/unit=word/distance<=2 (cql.keywords=/locale=grc.beta/ignoreCapitals/ignoreAccents"^dik")))


versus something like:
then( beta("^kai^ ^upoqhkhs^",IA), beta("^dik",IA), 2 )

No comments: