Some New Features in Solr
I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.
First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene. For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.) Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity. This component can provide:
- Term
- Term Frequency
- Position (based on analysis)
- Offset (character based)
- IDF – Inverse Document Frequency
Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what Taming Text is going to show.) For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics. And I haven’t even mentioned search, faceting and spell checking yet. Nor have I mentioned the other thing I am working on: adding search-result and document clustering to Solr. This is taking place on SOLR-769. The basic implementation I have now does search result clustering using the Carrot2 open source project. After that, I plan on adding in Mahout for document based clustering. I also know that Tom Morton, for Taming Text, has added in OpenNLP‘s Named Entity Recognition into Solr. Some point in the near future, I’ll put up a link to that code.
Bottom line: Solr ain’t just for search anymore!






Hi,
I’m going to create a simple document-document similarity matrix for a set of documents (using Lucene or any sub project).
It seems I have to create a feature vector for each of documents that contains all terms belong to it and weight using TFIDF. Cosine similarity measure is applied to compute similarity of documents.
Every tutorials and examples in the web just try to index and retrieve!
Can you help me?