Text Processing: Why Servers Choke : Beyond Search
Text Processing: Why Servers Choke : Beyond Search
If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.
Are we reading the same paper? This hardly says Lucene is a slow horse in the race. What it says, is that Lucene’s StandardTokenizer is slow in comparison to the papers approach for this one particular piece. Quite a leap to say that Lucene overall is slow, which just doesn’t hold water with most people’s experience. It also doesn’t compare it to other search engines. Most notably, the authors fully admit that their comparison “is not strictly an apples to apples comparison” because Lucene’s StandardTokenizer does other things to produce tokens that are actually useful for the user further down the stream, like identifying email and web addresses, etc. Don’t get me wrong, I’m not saying SpeedyFX isn’t interesting and worthwhile, just saying people shouldn’t infer something in a paper that isn’t there.
Note also that Lucene 2.3 has much improved indexing speed and this paper was written against 2.2, and many of the speedups focus on tokenization during the indexing process (i.e. object creation, object reuse, etc.). We also upgraded our grammar to use JFlex, which we found to be much faster than JavaCC. Can’t say what the numbers are in relation to this paper, but it would be interesting to see. Perhaps the SpeedyFX people can share their code so we can all see. I know, I know, researchers don’t like to do that, but to me it’s always a big gaping hole in these kinds of papers.
FWIW, StandardTokenizer is just one of many approaches to tokenization that Lucene provides. Furthermore, it is often not the long pole in the tent when it comes to indexing speed.
Still, the ideas are worth looking into. Lucene’s always open to improvements, and all can benefit from them, as is the beauty of Open Source.






[...] Article about Text Processing performance http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/ (Beyond Search) Response from Grant Ingersoll: http://lucene.grantingersoll.com/2008/09/07/text-processing-why-servers-choke-beyond-search/ [...]