<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; clustering</title>
	<atom:link href="http://lucene.grantingersoll.com/category/clustering/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog</title>
		<link>http://lucene.grantingersoll.com/2009/03/18/speeding-up-k-means-clustering-with-algebra-and-sparse-vectors-%c2%ab-lingpipe-blog/</link>
		<comments>http://lucene.grantingersoll.com/2009/03/18/speeding-up-k-means-clustering-with-algebra-and-sparse-vectors-%c2%ab-lingpipe-blog/#comments</comments>
		<pubDate>Wed, 18 Mar 2009 14:40:38 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[clustering]]></category>
		<category><![CDATA[kMeans clustering]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=168</guid>
		<description><![CDATA[k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent. via Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog. This is exactly what Apache Mahout does.  We have parallelized versions of a bunch of clustering algorithms, including k-means]]></description>
			<content:encoded><![CDATA[<p>k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent.</p>
<p>via <a href="http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/">Speeding up K-means Clustering with Algebra and Sparse Vectors « LingPipe Blog</a>.</p>
<p>This is exactly what Apache Mahout does.  We have parallelized versions of a bunch of clustering algorithms, including k-means</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/03/18/speeding-up-k-means-clustering-with-algebra-and-sparse-vectors-%c2%ab-lingpipe-blog/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Mahout Update</title>
		<link>http://lucene.grantingersoll.com/2009/02/09/mahout-update/</link>
		<comments>http://lucene.grantingersoll.com/2009/02/09/mahout-update/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 14:42:25 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=148</guid>
		<description><![CDATA[It&#8217;s been a while since I reported anything on Mahout (here&#8217;s why), but thought I would give an update.  I know it&#8217;s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a while since I reported anything on Mahout (here&#8217;s <a href="http://lucene.grantingersoll.com/2009/01/26/lucid-imagination/">why</a>), but thought I would give an update.  I know it&#8217;s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am now testing and verifying the release candidate.  Once that&#8217;s done, I&#8217;ll post an RC for vote and then we should be able to release.</p>
<p>Going forward, there should be several new algorithms going in post 0.1, as Isabel Drost has added some code for Winnow and Perceptron implementations and I believe Karl Wettin has some work on hierarchical clustering in place.  I also believe Ted Dunning and Jeff Eastman are working on Dirichlet clustering.  Finally, Sean Owen is always rocking on Taste&#8217;s collaborative filtering capabilities, so there will no doubt be more goodness in that regard.  As for me, I&#8217;m working on integrating clustering (Carrot2 and Mahout) into Solr and will be writing a chapter on doing text clustering in <a href="http://www.manning.com/ingersoll">Taming Text</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/02/09/mahout-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Congrats to Tika and Welcome to the Lucene Stack!</title>
		<link>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 15:43:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=130</guid>
		<description><![CDATA[Congratulations to Apache Tika (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]]]></description>
			<content:encoded><![CDATA[<p>Congratulations to <a href="http://incubator.apache.org/tika">Apache Tika</a> (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as <a href="http://incubator.apache.org/projects/pdfbox.html">PDFBox</a>, <a href="http://poi.apache.org">POI,</a> and others into a single, easy to use framework that makes it easy to add extracted content to Lucene, Solr and any other text application.  This is something many of us do whenever we work with file formats and it is also relates to one of the most frequently asked questions on the user mailing lists (that is, &#8220;how do I get text from Word/PDF/Excel?)</p>
<p>Tika&#8217;s interface is very similar to SAX, so it is easy to think about extraction in terms of receiving SAX events just like you do with XML, which is also nice because it is thus streaming and doesn&#8217;t require you to load a whole document into memory before dealing with it.</p>
<p>I&#8217;m now in the process of incorporating Tika into Solr (see <a href="https://issues.apache.org/jira/browse/SOLR-284">SOLR-284</a>) and I think we&#8217;ll eventually see it hooked into the Data Import Handler (DIH) in Solr, too, such that one could easily get content from a DB or a URL with the DIH, and then extract it if it is a binary object.</p>
<p>Ultimately, Tika is one more piece to the puzzle when it comes to dealing with content and it fits well with my <strong>personal</strong> vision (i.e. removing my Lucene PMC hat) of what Lucene is and should become.  Namely, as we move forward beyond just search (since search is a commodity these days, thanks to Lucene), it is important to have a whole suite of tools to bring to bear on the problem of dealing with structured and unstructured data.  Thus, things like Lucene, Solr, Carrot2, UIMA, Mahout, Tika, OpenNLP and other tools all should be easily usable by text tamers (riffing on my &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8221; theme&#8230;)  in creating intelligent applications.  As Lucene continues to develop and grow, it should become easier and easier to build things using the Lucene Stack which should spur a new wave of ideas and opportunities for those paying attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Some New Features in Solr</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 12:41:08 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[tokenization]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=116</guid>
		<description><![CDATA[I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.</p>
<p>First off, is <a href="https://issues.apache.org/jira/browse/SOLR-651">SOLR-651</a>, which implements what I am calling a <a href="http://wiki.apache.org/solr/TermVectorComponent">Term Vector Component.</a> The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.)  Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity.  This component can provide:</p>
<ol>
<li>Term</li>
<li>Term Frequency</li>
<li>Position (based on analysis)</li>
<li>Offset (character based)</li>
<li>IDF &#8211; Inverse Document Frequency</li>
</ol>
<p>Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what <a href="http://www.manning.com/ingersoll">Taming Text</a> is going to show.)  For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics.  And I haven&#8217;t even mentioned search, faceting and spell checking yet.  Nor have I mentioned the other thing I am working on:  adding search-result and document clustering to Solr.  This is taking place on <a href="https://issues.apache.org/jira/browse/SOLR-769">SOLR-769</a>.  The basic implementation I have now does search result clustering using the <a href="http://project.carrot2.org/">Carrot2</a> open source project.  After that, I plan on adding in Mahout for document based clustering.  I also know that Tom Morton, for Taming Text, has added in <a href="http://opennlp.sourceforge.net/">OpenNLP</a>&#8216;s Named Entity Recognition into Solr.  Some point in the near future, I&#8217;ll put up a link to that code.</p>
<p>Bottom line: Solr ain&#8217;t just for search anymore!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Jeff Eastman&#8217;s Marvelous Cloud Computing Adventure</title>
		<link>http://lucene.grantingersoll.com/2008/03/28/jeff-eastmans-marvelous-cloud-computing-adventure/</link>
		<comments>http://lucene.grantingersoll.com/2008/03/28/jeff-eastmans-marvelous-cloud-computing-adventure/#comments</comments>
		<pubDate>Fri, 28 Mar 2008 11:51:22 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/03/28/jeff-eastmans-marvelous-cloud-computing-adventure/</guid>
		<description><![CDATA[Jeff Eastman&#8217;s Marvelous Cloud Computing Adventure Mahout&#8217;s newest committer, Jeff Eastman, has a new blog on Mahout and Hadoop&#8230;]]></description>
			<content:encoded><![CDATA[<p><a href="http://jeffeastman.blogspot.com/">Jeff Eastman&#8217;s Marvelous Cloud Computing Adventure</a></p>
<p>Mahout&#8217;s newest committer, Jeff Eastman, has a new blog on Mahout and Hadoop&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/03/28/jeff-eastmans-marvelous-cloud-computing-adventure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mahout: k-means Clustering</title>
		<link>http://lucene.grantingersoll.com/2008/03/01/mahout-k-means-clustering/</link>
		<comments>http://lucene.grantingersoll.com/2008/03/01/mahout-k-means-clustering/#comments</comments>
		<pubDate>Sat, 01 Mar 2008 13:00:18 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[kMeans clustering]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/03/01/mahout-k-means-clustering/</guid>
		<description><![CDATA[I committed a first crack at k-means clustering to Mahout last night, thanks again to Jeff Eastman&#8217;s excellent work.  This means Mahout now has two clustering algorithms designed to run using Hadoop&#8216;s map reduce algorithm, meaning it should be able to scale up to very large data sets. To learn more about k-means, see the [...]]]></description>
			<content:encoded><![CDATA[<p>I committed a first crack at k-means clustering to <a href="http://lucene.apache.org/mahout">Mahout</a> last night, thanks again to Jeff Eastman&#8217;s excellent <a href="https://issues.apache.org/jira/browse/MAHOUT-5">work</a>.  This means Mahout now has two clustering algorithms designed to run using <a href="http://hadoop.apache.org">Hadoop</a>&#8216;s map reduce algorithm, meaning it should be able to scale up to very large data sets.</p>
<p>To learn more about k-means, see the Mahout <a href="http://cwiki.apache.org/MAHOUT">wiki</a>, specifically our page on <a href="http://cwiki.apache.org/MAHOUT/k-means.html">k-means</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/03/01/mahout-k-means-clustering/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Mahout&#8217;s First Commit</title>
		<link>http://lucene.grantingersoll.com/2008/02/19/mahouts-first-commit/</link>
		<comments>http://lucene.grantingersoll.com/2008/02/19/mahouts-first-commit/#comments</comments>
		<pubDate>Wed, 20 Feb 2008 04:39:26 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[canopy clustering]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/02/19/mahouts-first-commit/</guid>
		<description><![CDATA[I have committed Mahout&#8217;s first Hadoop based machine learning code: https://issues.apache.org/jira/browse/MAHOUT-3 The code is an initial implementation of Canopy clustering. It is a start and it is great to see others jump right in and start adding code!  Great work, Jeff Eastman, who contributed the initial implementation! Now, we can start building more goodness in [...]]]></description>
			<content:encoded><![CDATA[<p>I have committed Mahout&#8217;s first Hadoop based machine learning code:<a href="https://issues.apache.org/jira/browse/MAHOUT-3"> https://issues.apache.org/jira/browse/MAHOUT-3</a><br />
The code is an initial implementation of <a href="http://cwiki.apache.org/MAHOUT/canopy-clustering.html">Canopy clustering</a>. It is a start and it is great to see others jump right in and start adding code!  Great work, Jeff Eastman, who contributed the initial implementation!</p>
<p>Now, we can start building more goodness in order to make Mahout world class when it comes to machine learning.  Now people actually have some real code they can check out and compile!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/02/19/mahouts-first-commit/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

