Archive for the 'clustering' Category

Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog

k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent.
via Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog.
This is exactly what Apache Mahout does.  We have parallelized versions of a bunch of clustering algorithms, including k-means

Mahout Update

It’s been a while since I reported anything on Mahout (here’s why), but thought I would give an update.  I know it’s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am [...]

Congrats to Tika and Welcome to the Lucene Stack!

Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]

Some New Features in Solr

I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.
First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those [...]

Jeff Eastman’s Marvelous Cloud Computing Adventure

Jeff Eastman’s Marvelous Cloud Computing Adventure
Mahout’s newest committer, Jeff Eastman, has a new blog on Mahout and Hadoop…

Mahout: k-means Clustering

I committed a first crack at k-means clustering to Mahout last night, thanks again to Jeff Eastman’s excellent work.  This means Mahout now has two clustering algorithms designed to run using Hadoop’s map reduce algorithm, meaning it should be able to scale up to very large data sets.
To learn more about k-means, see the Mahout [...]

Mahout’s First Commit

I have committed Mahout’s first Hadoop based machine learning code: https://issues.apache.org/jira/browse/MAHOUT-3
The code is an initial implementation of Canopy clustering. It is a start and it is great to see others jump right in and start adding code!  Great work, Jeff Eastman, who contributed the initial implementation!
Now, we can start building more goodness in order to [...]