Assumptions (in Apache Lucene and Solr and pretty much everything else) Considered Harmful

I had a Football (American Football, that is, not soccer) coach who always used to drill into our heads what happens when one assumes something about our opponent for that week; he’d get all worked up, hoist up his coaching shorts (you know the ones, they should be banned…), puff out his chest, give you a look that was part wink and part anger and say something to the effect of:  “You know what happens when you assume?  You make an ass out of ‘u’ and ‘me’.”

That saying often comes back to me often these days when I hear again and again why some programmer chose to do something a certain way in Apache Lucene (and to some extent Solr, but less so since it takes care of most of the details of Lucene) , despite all documentation and community saying don’t do it that way.  The usual case for this involves paging deeper into the results, but I’ve seen it in many other areas as well, such as:

  1. Loading everything into RAMDirectory instead of just relying on O/S caching because “it’s faster”, even though they don’t quantify it
  2. Faceting implementations
  3. Overriding defaults without testing
  4. Blindly using defaults without testing (stemming in particular!)
  5. Using a very large JVM Heap, thus choking off memory for the O/S, because “more memory is better”

In the paging case, the programmer thinks it is too expensive to execute the query a second or third time, so they go and retrieve 10 or 20 pages worth of results, if not much, much more, stuff them in a cache and then return the top ten.  There are many problems with this, the first being that most people don’t go beyond page one or two so all that work is wasted anyway.  The second is simply that Lucene is super fast at executing the search, never mind the fact that the Operating System probably cached everything it needs to do the search anyway and by you caching all that info you may have forced some of those O/S caches out of memory, thus slowing down subsequent searches.  Third, it creates excessive garbage which can lead to major collections, thus grinding the app to a halt.  Fourth, materializing the actual documents from disk for the search results is often expensive because doing so usually involves random seeks on disk.  In this case, the developer “assumes” Lucene would be slow at something, but didn’t bother to actually measure it to really know.  Often times what comes out of all this “premature optimization”, is a whole slew of code that now needs to be maintained, thus further complicating the application and making it harder for new developers to participate and make the application better, all while costing the company time and money.  Thus, assumptions in Lucene and Solr (and elsewhere) are Considered Harmful.  See Lucene’s Best Practices page and Solr’s Performance Factors, amongst other resources like “Lucene in Action” (2nd edition), for more info on how to do it right (or to challenge our assumptions!)

This isn’t just a Lucene/Solr phenomenon, and of course, I am not without sin, as I often catch myself or get caught by others, too.  Writing this is as much a reminder to me to be pragmatic and to test my assumptions as it is to anyone else.  Of course, one of the best things about a community like Lucene and Solr is having people who can objectively challenge your assumptions as opposed to more traditional development models where it is often the case that the most experienced or most senior developers rule the roost and developers are shackled by the “one right way” to do things.

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image