Open Source Search Engine Comparison
Interesting comparison of open source search engines available at http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf. While it reflects OK on Lucene (hey, we can’t be perfect at everything,) I am interested in finding out more details about what settings were used for indexing. If they just used the out of the box settings, then I would argue that they need to spend a little bit of time finding the right settings for indexing in order to optimize indexing speed. Furthermore, it is pretty hard to make apples-for-apples comparisons of search engines given the different analysis techniques, etc. Each search engine will almost certainly have their own way of tokenizing, etc. So, while out of the box comparison is a starting point, it should not be the end point. Finally, Lucene 2.3 is shaping up to speed up indexing significantly. I, personally, have benchmarked it at ~5 times faster than 2.2 thanks to the new indexing code from Michael McCandless. It also contains better controls for memory usage during indexing. So, it would be really interesting to re-run the experiments in this paper again for Lucene 2.3, once it is released.
I would also be interested in hearing an analysis of the communities surrounding each of the projects. How much support is there for fixing bugs, answering user questions, etc.? How active is the development? How many active contributors, committers, etc.? Of course, this is not the job of this paper, I just think it would be interesting.



I should also follow up on this that I think it is the duty of people publishing works like this to also publish the code used to do the evaluation. What good is research if you can’t replicate it?
I know several others who have been able to do large scale evaluations of Lucene and have not run into some of the issues mentioned in the article.
I was also surprised to see they neglected to mention any assumptions — default language for all engines was English. A very poor assumption for a comparison of this sort, especially with globalization on all our minds. I grabbed the Zettair package to take a peek. At first glance it appears to be English only… and its possible romance languages w/stoplist or stemmers could be added. But that is a serious limitation.
I wouldn’t even touch a search engine toolkit if it was limited to one language or just romance languages these days.
-marc
((Note — your CAPTCHA has the letter l or i — could not tell which.))