Payloads
Michael Busch recently committed some code that enables Lucene to store payloads at the term level (see https://issues.apache.org/jira/browse/LUCENE-755) and I have started working on enabling these payloads to be incorporated into search and scoring. (see http://wiki.apache.org/lucene-java/Payload_Planning and https://issues.apache.org/jira/browse/LUCENE-834)
So, you might be asking yourself, what exactly are payloads good for? Naturally, the answer is a lot! For example, in the “Anatomy of a Search Engine” by Brin and Page (discussed here) see section 4.2.5 on Hit Lists where they discuss how they store information about the term in the index, such as font, capitalization, etc. which then get factored into the scoring algorithm later. You could also store things like part of speech, or per term weights and then do things like score noun matches higher than verbs or use the encoded weight as part of the score calculation. Another option would be to store synonyms of the words, or the synset from Wordnet. In NLP applications, it could be useful for storing co-references or other types of linkage and then use a graph ranking strategy such as PageRank or TextRank (discussed here.) Another option is to store XPath information or other metadata which are often stored in separate fields and require stitching the information back together.
In a sense, payloads open up a lot of new avenues for search in Lucene. Open questions remain as to how much data should be stored and still have good performance. Also note, that the current API is still considered experimental and may change, although I doubt it will drastically change.
What ideas do you have for payloads that I missed? Let me know or update the planning page on the Lucene wiki.
