Understanding the theory behind Lucene

Much has been written on how to use Lucene and much has been written on the theory of information retrieval. Lately, I have been brushing up on my theoretical understanding of IR in the context of Lucene. Despite the fact that Lucene seems quite complicated (inverted files, nested boolean, prefix, range, proximity queries, span queries, term vectors, and so on) in the end it is pretty solidly based on a standard inverted index model with some really nice insights added to make it run very fast.

The reason I say this, is I have been (re)reading Modern Information Retrieval (MIR) by Baeza-Yates and Ribeiro-Neto and I am constantly saying to myself, “that’s why that was done” or “so that’s how it is done” whether it is in the context of index creation or phrase searching or some other aspect of Lucene. For instance, chapter 8 (particulary section 8.2) covers the creation of the inverted index and how to search it. It describes how to do the merges and gives some tips on how to make searches perform faster. Chapter 7 covers document preprocessing which is the Lucene equivalent of building an Analyzer and the associated Filters. Chapter 4 describes many of the different types of queries a system should offer (Lucene implements most of those suggested). And to truly understand the underpinnings, Chapter 2 describes the vector space model (as well as other models) for search.

So, if you want to understand Lucene at a deeper level, I very much recommend you read Modern Information Retrieval or some other book on the basics of IR.

One Response to “Understanding the theory behind Lucene”

  1. [...] Understanding the theory behind Lucene [...]

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image