Archive for the 'clustering' Category
k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent. via Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog. This is exactly what Apache Mahout does. We have parallelized versions of a bunch of clustering algorithms, including k-means
March 18th, 2009 | Posted in clustering, kMeans clustering, machine learning, Mahout | 2 Comments
It’s been a while since I reported anything on Mahout (here’s why), but thought I would give an update. I know it’s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon. I think I have all the Maven release stuff in place and am [...]
February 9th, 2009 | Posted in Apache, clustering, Java, Mahout, Solr, Taming Text | No Comments
Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation! And welcome to the Lucene project! Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]
November 13th, 2008 | Posted in Apache, clustering, Java, Lucene, machine learning, Mahout, Manning, OpenNLP, Search, Solr, Taming Text, Tika | 3 Comments
I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene. For [...]
October 23rd, 2008 | Posted in Apache, clustering, Java, Lucene, machine learning, Mahout, Manning, Search, Solr, spell checking, Taming Text, term vectors, tokenization | 1 Comment
Jeff Eastman’s Marvelous Cloud Computing Adventure Mahout’s newest committer, Jeff Eastman, has a new blog on Mahout and Hadoop…
March 28th, 2008 | Posted in Apache, clustering, Hadoop, Java, Lucene, machine learning, Mahout, Map Reduce | No Comments
I committed a first crack at k-means clustering to Mahout last night, thanks again to Jeff Eastman’s excellent work. This means Mahout now has two clustering algorithms designed to run using Hadoop‘s map reduce algorithm, meaning it should be able to scale up to very large data sets. To learn more about k-means, see the [...]
March 1st, 2008 | Posted in Apache, clustering, Hadoop, Java, kMeans clustering, machine learning, Mahout, Map Reduce | 1 Comment
I have committed Mahout’s first Hadoop based machine learning code: https://issues.apache.org/jira/browse/MAHOUT-3 The code is an initial implementation of Canopy clustering. It is a start and it is great to see others jump right in and start adding code! Great work, Jeff Eastman, who contributed the initial implementation! Now, we can start building more goodness in [...]
February 19th, 2008 | Posted in Apache, canopy clustering, clustering, Hadoop, Java, machine learning, Mahout, Map Reduce | 2 Comments