Archive for the 'Taming Text' Category

Mahout Update

It’s been a while since I reported anything on Mahout (here’s why), but thought I would give an update.  I know it’s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am [...]

Congrats to Tika and Welcome to the Lucene Stack!

Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]

Charlotte JUG » October Slides Available – Search & Analysis

Charlotte JUG » October Slides Available – Search & Analysis
Had a lot of fun at my recent talk at the Charlotte JUG.  They’ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of “Taming Text“.  Wish I would have [...]

Some New Features in Solr

I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.
First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those [...]

Charlotte JUG » OCT 15TH – 6PM – Search and Text Analysis

Charlotte JUG » OCT 15TH – 6PM – Search and Text Analysis
I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things like Lucene, Solr, OpenNLP and Mahout, amongst other things.  Basically, a high level talk on my book.

Opening up Academic Research on IR and Machine Learning

Kudo’s to Dr. Ted Pedersen for finally saying out loud (in the latest issue of Computational Linguistics, thanks to Bob Carpenter for the pointer) what I’ve long thought about academic publications on topics like information retrieval and machine learning:  namely, publications of empirical results in software systems without publishing the software is a disservice to [...]

Manning: Taming Text

Manning: Taming Text
Scary…  I guess it is real!