Tao and the Art of Search: Yin Yang and TF-IDF
I often explain search and relevance at talks and training classes for Lucene and Solr. In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as TF-IDF) in light of either the vector space model or in terms of determining relevance.
The basic concept is that term frequency is the number of times the term occurs in a document while the IDF is the inverse of the number of times the term in question occurs in all of the documents. Thus, the more often a term appears in a document, the more important the document. The IDF, then acts as a counterbalance to the term frequency by saying that the more documents the term appears in, the less important it is overall in determining the importantce of the term and the containing document. Hence, I usually explain TF-IDF as the “Yin and Yang of Search”, and this seems to resonate well with my students, as it pretty clearly demonstrates how the opposing forces work to creating meaningful results for end users. Of course, as sometimes happens with opposing forces, one outweighs the other leading to bad results.
For more on the yin yang, see Yin and yang – Wikipedia, the free encyclopedia.




Interesting, I might put that out there the next time I explain tf*idf. But, in my experience, people quickly grasp the basic information theory idea that the significance of a signal is a combination of its strength and its distinctiveness. The nice thing about explaining tf*idf this way is that it helps you further explain where it breaks down, e.g., that term frequency is a very crude proxy for signal strength and that inverse document frequency conflates true distinctiveness with noise.
[...] * Tao and the Art of Search: Yin Yang and TF-IDF [...]
[...] This balancing and counter-balancing involved in search algorithms has been called the “yin and yang of search” [...]