<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Manning</title>
	<atom:link href="http://lucene.grantingersoll.com/category/manning/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Manning: Mahout in Action</title>
		<link>http://lucene.grantingersoll.com/2009/12/29/manning-mahout-in-action/</link>
		<comments>http://lucene.grantingersoll.com/2009/12/29/manning-mahout-in-action/#comments</comments>
		<pubDate>Tue, 29 Dec 2009 20:33:44 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=310</guid>
		<description><![CDATA[Very cool, Manning already has up the first 6 chapters of Mahout in Action.]]></description>
			<content:encoded><![CDATA[<p>Very cool, Manning already has up the first 6 chapters of<a href="http://www.manning.com/affiliate/idevaffiliate.php?id=1069_219"> Mahout in Action</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/12/29/manning-mahout-in-action/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Congrats to Tika and Welcome to the Lucene Stack!</title>
		<link>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 15:43:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=130</guid>
		<description><![CDATA[Congratulations to Apache Tika (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]]]></description>
			<content:encoded><![CDATA[<p>Congratulations to <a href="http://incubator.apache.org/tika">Apache Tika</a> (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as <a href="http://incubator.apache.org/projects/pdfbox.html">PDFBox</a>, <a href="http://poi.apache.org">POI,</a> and others into a single, easy to use framework that makes it easy to add extracted content to Lucene, Solr and any other text application.  This is something many of us do whenever we work with file formats and it is also relates to one of the most frequently asked questions on the user mailing lists (that is, &#8220;how do I get text from Word/PDF/Excel?)</p>
<p>Tika&#8217;s interface is very similar to SAX, so it is easy to think about extraction in terms of receiving SAX events just like you do with XML, which is also nice because it is thus streaming and doesn&#8217;t require you to load a whole document into memory before dealing with it.</p>
<p>I&#8217;m now in the process of incorporating Tika into Solr (see <a href="https://issues.apache.org/jira/browse/SOLR-284">SOLR-284</a>) and I think we&#8217;ll eventually see it hooked into the Data Import Handler (DIH) in Solr, too, such that one could easily get content from a DB or a URL with the DIH, and then extract it if it is a binary object.</p>
<p>Ultimately, Tika is one more piece to the puzzle when it comes to dealing with content and it fits well with my <strong>personal</strong> vision (i.e. removing my Lucene PMC hat) of what Lucene is and should become.  Namely, as we move forward beyond just search (since search is a commodity these days, thanks to Lucene), it is important to have a whole suite of tools to bring to bear on the problem of dealing with structured and unstructured data.  Thus, things like Lucene, Solr, Carrot2, UIMA, Mahout, Tika, OpenNLP and other tools all should be easily usable by text tamers (riffing on my &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8221; theme&#8230;)  in creating intelligent applications.  As Lucene continues to develop and grow, it should become easier and easier to build things using the Lucene Stack which should spur a new wave of ideas and opportunities for those paying attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Charlotte JUG » October Slides Available &#8211; Search &amp; Analysis</title>
		<link>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 14:30:46 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Charlotte]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=120</guid>
		<description><![CDATA[Charlotte JUG » October Slides Available &#8211; Search &#38; Analysis Had a lot of fun at my recent talk at the Charlotte JUG.  They&#8217;ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of &#8220;Taming Text&#8220;.  Wish I would [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.charlottejug.org/2008/10/19/october-meeting-recap/">Charlotte JUG » October Slides Available &#8211; Search &amp; Analysis</a></p>
<p>Had a lot of fun at my recent talk at the Charlotte JUG.  They&#8217;ve got a good core of people and there was a lot of good discussion about the topic. Even managed to give away some free eBooks of &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8220;.  Wish I would have had time for demos, but as it was there was a lot to cover.  At any rate, the link has a copy of the slides I presented.</p>
<p>I would also be remiss if I did not mention the food we had at at <a href="http://www.cajunqueen.net/">Cajun Queen</a> before hand&#8230;  Could have stayed there longer and enjoyed the music and a few frothy brews&#8230;  Next time&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/24/charlotte-jug-%c2%bb-october-slides-available-search-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some New Features in Solr</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 12:41:08 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[tokenization]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=116</guid>
		<description><![CDATA[I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.</p>
<p>First off, is <a href="https://issues.apache.org/jira/browse/SOLR-651">SOLR-651</a>, which implements what I am calling a <a href="http://wiki.apache.org/solr/TermVectorComponent">Term Vector Component.</a> The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.)  Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity.  This component can provide:</p>
<ol>
<li>Term</li>
<li>Term Frequency</li>
<li>Position (based on analysis)</li>
<li>Offset (character based)</li>
<li>IDF &#8211; Inverse Document Frequency</li>
</ol>
<p>Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what <a href="http://www.manning.com/ingersoll">Taming Text</a> is going to show.)  For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics.  And I haven&#8217;t even mentioned search, faceting and spell checking yet.  Nor have I mentioned the other thing I am working on:  adding search-result and document clustering to Solr.  This is taking place on <a href="https://issues.apache.org/jira/browse/SOLR-769">SOLR-769</a>.  The basic implementation I have now does search result clustering using the <a href="http://project.carrot2.org/">Carrot2</a> open source project.  After that, I plan on adding in Mahout for document based clustering.  I also know that Tom Morton, for Taming Text, has added in <a href="http://opennlp.sourceforge.net/">OpenNLP</a>&#8216;s Named Entity Recognition into Solr.  Some point in the near future, I&#8217;ll put up a link to that code.</p>
<p>Bottom line: Solr ain&#8217;t just for search anymore!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis</title>
		<link>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/#comments</comments>
		<pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Charlotte]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[North Carolina]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=112</guid>
		<description><![CDATA[Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things like Lucene, Solr, OpenNLP and Mahout, amongst other things.  Basically, a high level talk on my book.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.charlottejug.org/2008/09/30/oct-15th-6pm-search-and-text-analysis/">Charlotte JUG » OCT 15TH &#8211; 6PM &#8211; Search and Text Analysis</a></p>
<p>I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things like Lucene, Solr, OpenNLP and Mahout, amongst other things.  Basically, a high level talk on my book.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/01/charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Manning: Taming Text</title>
		<link>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/</link>
		<comments>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/#comments</comments>
		<pubDate>Mon, 28 Apr 2008 16:03:17 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=77</guid>
		<description><![CDATA[Manning: Taming Text Scary&#8230;  I guess it is real!]]></description>
			<content:encoded><![CDATA[<p><a href="http://manning.com/ingersoll/">Manning: Taming Text</a></p>
<p>Scary&#8230;  I guess it is real!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/04/28/manning-taming-text/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

