Why Lucene Isn’t That Good | Javalobby

Why Lucene Isn’t That Good | Javalobby

Patches welcome…  I know that is an old saw, but that is the only way it’s going to get better.

There are some good points in here, and some stuff that is a bit dramatic.

We do try to keep adapting Lucene and make it better, but in some respects we are damned if we do, damned if we don’t.  The whole abstract vs. interface debate has been going on for a long time on Lucene.  If we switched to interfaces, then people would be complaining constantly about how we break their code everytime we do a release if we introduce new methods.  If we leave things as abstract classes, then people like Cedric complain that Lucene is hard to extend.

As for the “final” declarations, the reason it works that way, is that we can’t see the future.  Often times, the things that are final now are legacy from back in the early days when we couldn’t imagine some of the uses that Lucene is now being used for, or as another commenter on the thread said, to avoid unintended consequences.  That’s why people submit patches and things get improved.  Sorry we can’t all work on Lucene nonstop all day.  If Lingway wants to hire me to do that, you know how to reach me! (at least, as a contractor, I’m not interested in leaving my current employment, but I do offer consulting.)

As for SpanQueries, yes they can be slower.  I’d love to have a discussion with someone like you who is a heavy user to see how to improve them.  Please send your profiling info ASAP to the java-dev mailing list!  Please don’t let all those hours you spent go to waste.  Even a half baked patch is a starting point.

I do agree about scoring being pluggable, but I gotta tell ya’, scoring is hard and not for the faint of heart.  It’s a whole other layer and doing it right means being fast and accurate and doing it wrong means deep, dark, scary rabbit holes where you don’t see light for days.  One of the simplest ways, however, to improve scoring is to change the length normalization.

As for some of the other “higher” features, like crawling/clustering, those are nice, but they don’t belong in the core of Lucene b/c not everyone needs them, although the number is increasing.  How many people have collections that go beyond 10-20M documents?  What would be nice, however is a contrib module or a layer above Lucene that provides all those nice things you want (you know, you can embed Solr in no time, by the way).  Lucene is meant to be really fast on one machine and to also play nice when you put in the appropriate distributed pieces.  It’s unfortunate that no one has donated the distributed piece yet (although Solr does now have it, thanks to Yonik!)

At any rate, thanks for the ideas.  Hope to see your patches soon!

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image