Tika and Solr
As I mentioned in Congrats to Tika and Welcome to the Lucene Stack
I’ve been working on adding Tika support to Solr. Well, I finally committed it today, with a special thanks to Chris Harris and Eric Pugh for helping see it through with me.
What does this mean? It is now possible to send any of Tika’s supported document types (MS Office, PDF, XML, HTML, etc.) and have the content extracted and then indexed, all within Solr.
For more information on how to use it, see the Solr Wiki entry.




[...] * Tika and Solr [...]
A natural enhancement / extension to Metadata extraction and identification toolkit would be to layer a content analysis framework on top. There is value in extracting named entities out of the content. These named entities can then be used to slice and dice information by People, Company, Places, etc.
Grant, is someone already working on it? Any plans in the pipeline?
Hi Sameer,
Tom Morton and I have a written on this in “Taming Text” (http://www.manning.com/ingersoll). The associated code has integration between Solr and OpenNLP, which can do Named Entity Recognition. That’s a starting point. You could also easily plugin other algorithms, I think, but I don’t know if anyone is currently offering that in Solr.
Thanks! I’ll check it out.
Grant,
Just been digging through Tika, and it looks like it’s come a long ways since I first heard about it! And the patches you’ve made to Solr to support rich documents is great!
Eric
Thanks, Eric! Your patch was the catalyst, without a doubt.
[...] thought I had a unique idea… I found this response from Sameer on this “old” Tika and SOLR article: A natural enhancement / extension to Metadata extraction and identification toolkit would [...]