Archive for the 'machine learning' Category

Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog

k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent. via Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog. This is exactly what Apache Mahout does.  We have parallelized versions of a bunch of clustering algorithms, including k-means

Hadoop, Analytical Software, Finds Uses Beyond Search – NYTimes.com

Hadoop, Analytical Software, Finds Uses Beyond Search – NYTimes.com. Nice writeup on Hadoop in the NYT today.  Of course, Hadoop is often used to power machine learning, too, which is the premise behind using it on Apache Mahout.

Lucid Imagination » Add our Lucene Ecosystem Search Engine to Firefox

Lucid Imagination » Add our Lucene Ecosystem Search Engine to Firefox Mark Miller shows how to add Lucid’s Lucene ecosystem search as a Firefox plugin.  Now you can search all the Lucene project (and subproject) archives, website, wiki from the comfort of your browser plugin.

GSOC 2009 at the ASF: Looking for students interested in Lucene

SummerOfCode2009 – General Wiki It’s that time of year again.  Time for students to sign up for Google Summer of Code.  Gist of it:  Get paid to work in Open Source for the summer. I’ve signed up to mentor for Apache Mahout.  We are looking for students interested in implementing cutting-edge machine learning algorithms, optionally [...]

Congrats to Tika and Welcome to the Lucene Stack!

Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]

Intro to Mahout slides available

My intro to Mahout slides are available here.

Charlotte JUG » October Slides Available – Search & Analysis

Charlotte JUG » October Slides Available – Search & Analysis Had a lot of fun at my recent talk at the Charlotte JUG.  They’ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of “Taming Text“.  Wish I would [...]

Some New Features in Solr

I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]

Opening up Academic Research on IR and Machine Learning

Kudo’s to Dr. Ted Pedersen for finally saying out loud (in the latest issue of Computational Linguistics, thanks to Bob Carpenter for the pointer) what I’ve long thought about academic publications on topics like information retrieval and machine learning:  namely, publications of empirical results in software systems without publishing the software is a disservice to [...]

BarCamp wiki / BarCampRDU

BarCamp wiki / BarCampRDU I’ll be at BarCampRDU tomorrow.  I proposed two sessions, one on Hadoop and Mahout and one on Lucene and Solr.  I don’t think I really want to do both, but I would like to do at least one, so we’ll see what other people are interested in. If you’re around and [...]