<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Taming Text</title>
	<atom:link href="http://lucene.grantingersoll.com/category/taming-text/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Wed, 18 Jan 2012 13:33:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Taming Text Update</title>
		<link>http://lucene.grantingersoll.com/2011/12/27/taming-text-update/</link>
		<comments>http://lucene.grantingersoll.com/2011/12/27/taming-text-update/#comments</comments>
		<pubDate>Tue, 27 Dec 2011 13:45:39 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=460</guid>
		<description><![CDATA[Drew, Tom and I are feverishly working away on finishing up Taming Text.  We are currently in the process of addressing the feedback we got from our final review and should have updates up soon.  I have also posted all of the book&#8217;s source code up on Github under the Taming Text user.  The source includes, [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" title="Taming Text book cover" src="http://manning.com/ingersoll/ingersoll_cover150.jpg" alt="" width="150" height="188" /></p>
<p>Drew, Tom and I are feverishly working away on finishing up <a href="http://www.manning.com/affiliate/idevaffiliate.php?id=1069_148">Taming Text</a>.  We are currently in the process of addressing the feedback we got from our final review and should have updates up soon.  I have also posted all of the book&#8217;s source code up on Github under the <a href="http://www.github.com/tamingtext">Taming Text user</a>.  The source includes, amongst other things, a simple Question Answering system using Solr and OpenNLP, as well as analyzers for Lucene that use OpenNLP for sentence detection, part of speech tagging and Named Entity Recognition.  As with most books, these examples are meant to be just that, examples.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2011/12/27/taming-text-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mahout Update</title>
		<link>http://lucene.grantingersoll.com/2009/02/09/mahout-update/</link>
		<comments>http://lucene.grantingersoll.com/2009/02/09/mahout-update/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 14:42:25 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=148</guid>
		<description><![CDATA[It&#8217;s been a while since I reported anything on Mahout (here&#8217;s why), but thought I would give an update.  I know it&#8217;s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a while since I reported anything on Mahout (here&#8217;s <a href="http://lucene.grantingersoll.com/2009/01/26/lucid-imagination/">why</a>), but thought I would give an update.  I know it&#8217;s been promised before, but the committers have been diligently working on a 0.1 release, which should be out very soon.  I think I have all the Maven release stuff in place and am now testing and verifying the release candidate.  Once that&#8217;s done, I&#8217;ll post an RC for vote and then we should be able to release.</p>
<p>Going forward, there should be several new algorithms going in post 0.1, as Isabel Drost has added some code for Winnow and Perceptron implementations and I believe Karl Wettin has some work on hierarchical clustering in place.  I also believe Ted Dunning and Jeff Eastman are working on Dirichlet clustering.  Finally, Sean Owen is always rocking on Taste&#8217;s collaborative filtering capabilities, so there will no doubt be more goodness in that regard.  As for me, I&#8217;m working on integrating clustering (Carrot2 and Mahout) into Solr and will be writing a chapter on doing text clustering in <a href="http://www.manning.com/ingersoll">Taming Text</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/02/09/mahout-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Congrats to Tika and Welcome to the Lucene Stack!</title>
		<link>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 15:43:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=130</guid>
		<description><![CDATA[Congratulations to Apache Tika (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]]]></description>
			<content:encoded><![CDATA[<p>Congratulations to <a href="http://incubator.apache.org/tika">Apache Tika</a> (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as <a href="http://incubator.apache.org/projects/pdfbox.html">PDFBox</a>, <a href="http://poi.apache.org">POI,</a> and others into a single, easy to use framework that makes it easy to add extracted content to Lucene, Solr and any other text application.  This is something many of us do whenever we work with file formats and it is also relates to one of the most frequently asked questions on the user mailing lists (that is, &#8220;how do I get text from Word/PDF/Excel?)</p>
<p>Tika&#8217;s interface is very similar to SAX, so it is easy to think about extraction in terms of receiving SAX events just like you do with XML, which is also nice because it is thus streaming and doesn&#8217;t require you to load a whole document into memory before dealing with it.</p>
<p>I&#8217;m now in the process of incorporating Tika into Solr (see <a href="https://issues.apache.org/jira/browse/SOLR-284">SOLR-284</a>) and I think we&#8217;ll eventually see it hooked into the Data Import Handler (DIH) in Solr, too, such that one could easily get content from a DB or a URL with the DIH, and then extract it if it is a binary object.</p>
<p>Ultimately, Tika is one more piece to the puzzle when it comes to dealing with content and it fits well with my <strong>personal</strong> vision (i.e. removing my Lucene PMC hat) of what Lucene is and should become.  Namely, as we move forward beyond just search (since search is a commodity these days, thanks to Lucene), it is important to have a whole suite of tools to bring to bear on the problem of dealing with structured and unstructured data.  Thus, things like Lucene, Solr, Carrot2, UIMA, Mahout, Tika, OpenNLP and other tools all should be easily usable by text tamers (riffing on my &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8221; theme&#8230;)  in creating intelligent applications.  As Lucene continues to develop and grow, it should become easier and easier to build things using the Lucene Stack which should spur a new wave of ideas and opportunities for those paying attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Charlotte JUG » October Slides Available &#8211; Search &amp; Analysis</title>
		<link>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 14:30:46 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Charlotte]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=120</guid>
		<description><![CDATA[Charlotte JUG » October Slides Available &#8211; Search &#38; Analysis Had a lot of fun at my recent talk at the Charlotte JUG.  They&#8217;ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of &#8220;Taming Text&#8220;.  Wish I would [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.charlottejug.org/2008/10/19/october-meeting-recap/">Charlotte JUG » October Slides Available &#8211; Search &amp; Analysis</a></p>
<p>Had a lot of fun at my recent talk at the Charlotte JUG.  They&#8217;ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8220;.  Wish I would have had time for demos, but as it was there was a lot to cover.  At any rate, the link has a copy of the slides I presented.</p>
<p>I would also be remiss if I did not mention the food we had at at <a href="http://www.cajunqueen.net/">Cajun Queen</a> before hand&#8230;  Could have stayed there longer and enjoyed the music and a few frothy brews&#8230;  Next time&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some New Features in Solr</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 12:41:08 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[tokenization]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=116</guid>
		<description><![CDATA[I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.</p>
<p>First off, is <a href="https://issues.apache.org/jira/browse/SOLR-651">SOLR-651</a>, which implements what I am calling a <a href="http://wiki.apache.org/solr/TermVectorComponent">Term Vector Component.</a> The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.)  Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity.  This component can provide:</p>
<ol>
<li>Term</li>
<li>Term Frequency</li>
<li>Position (based on analysis)</li>
<li>Offset (character based)</li>
<li>IDF &#8211; Inverse Document Frequency</li>
</ol>
<p>Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what <a href="http://www.manning.com/ingersoll">Taming Text</a> is going to show.)  For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics.  And I haven&#8217;t even mentioned search, faceting and spell checking yet.  Nor have I mentioned the other thing I am working on:  adding search-result and document clustering to Solr.  This is taking place on <a href="https://issues.apache.org/jira/browse/SOLR-769">SOLR-769</a>.  The basic implementation I have now does search result clustering using the <a href="http://project.carrot2.org/">Carrot2</a> open source project.  After that, I plan on adding in Mahout for document based clustering.  I also know that Tom Morton, for Taming Text, has added in <a href="http://opennlp.sourceforge.net/">OpenNLP</a>&#8216;s Named Entity Recognition into Solr.  Some point in the near future, I&#8217;ll put up a link to that code.</p>
<p>Bottom line: Solr ain&#8217;t just for search anymore!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis</title>
		<link>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/#comments</comments>
		<pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Charlotte]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[North Carolina]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=112</guid>
		<description><![CDATA[Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things like Lucene, Solr, OpenNLP and Mahout, amongst other things.  Basically, a high level talk on my book.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.charlottejug.org/2008/09/30/oct-15th-6pm-search-and-text-analysis/">Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis</a></p>
<p>I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things like Lucene, Solr, OpenNLP and Mahout, amongst other things.  Basically, a high level talk on my book.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Opening up Academic Research on IR and Machine Learning</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/</link>
		<comments>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 17:09:04 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110</guid>
		<description><![CDATA[Kudo&#8217;s to Dr. Ted Pedersen for finally saying out loud (in the latest issue of Computational Linguistics, thanks to Bob Carpenter for the pointer) what I&#8217;ve long thought about academic publications on topics like information retrieval and machine learning:  namely, publications of empirical results in software systems without publishing the software is a disservice to [...]]]></description>
			<content:encoded><![CDATA[<p>Kudo&#8217;s to <a href="http://www.d.umn.edu/~tpederse/">Dr. Ted Pedersen</a> for finally saying <a href="http://www.d.umn.edu/~tpederse/Pubs/pedersen-last-word-2008.pdf">out loud</a> (in the latest issue of <em>Computational Linguistics</em>, thanks to <a href="http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/">Bob Carpenter</a> for the pointer<em>) </em>what I&#8217;ve long thought about academic publications on topics like information retrieval and machine learning:  namely, publications of empirical results in software systems without publishing the software is a disservice to the community at best, and pointless at it&#8217;s worst.  It hinders learning and it hinders the furthering of the field.  It accounts for a good chunk of the reason I started the <a href="http://lucene.apache.org/mahout">Mahout</a> project (now we can say &#8220;download Mahout and run X&#8221;), partially explains why I&#8217;m writing <a href="http://manning.com/ingersoll">Taming Text</a> and <a href="http://www.paperoftheweek.com">Paper of the Week</a> (now we can say &#8220;here&#8217;s how this stuff really works in practice&#8221;) and also why I wanted Lucene to have a built-in benchmarking tool where people can publish their configurations so others can try them out and repeat them.</p>
<p>If you ever read papers in this field, you quickly notice they all have this nice theory and they make all these nice (grand?) claims, but at the end of the day, 99% of the interested population can&#8217;t reproduce them because they don&#8217;t have the software to do it.  As Dr. Pedersen also points out, often times the creator can&#8217;t even reproduce it.  Even worse, they do publish the software, but it is abandoned, or undocumented, and who knows what settings are required: the Professor has moved on (&#8220;I got a new grant!&#8221;), the Grad students have moved on (&#8220;I got my PHd.&#8221;) and most importantly, the funding has moved on (Gov&#8217;t Program Manager: &#8220;I had another successful project, time for a bigger budget!&#8221;).  Sorry for the cynicism&#8230;</p>
<p>In fact, I think one thing Mahout can really offer the likes of Researchers is that you can focus on the big ideas, and we&#8217;ll take care of making sure your prototype is scalable, documented and maintained!  Besides, wouldn&#8217;t you rather be a part of seeing people actually use your work instead of it just living on some piece of paper or locked away in your hard drive collecting cosmic dust?</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Manning: Taming Text</title>
		<link>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/</link>
		<comments>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/#comments</comments>
		<pubDate>Mon, 28 Apr 2008 16:03:17 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=77</guid>
		<description><![CDATA[Manning: Taming Text Scary&#8230;  I guess it is real!]]></description>
			<content:encoded><![CDATA[<p><a href="http://manning.com/ingersoll/">Manning: Taming Text</a></p>
<p>Scary&#8230;  I guess it is real!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

