<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; OpenNLP</title>
	<atom:link href="http://lucene.grantingersoll.com/category/opennlp/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Tue, 31 Aug 2010 14:45:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Congrats to Tika and Welcome to the Lucene Stack!</title>
		<link>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 15:43:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=130</guid>
		<description><![CDATA[Congratulations to Apache Tika (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]]]></description>
			<content:encoded><![CDATA[<p>Congratulations to <a href="http://incubator.apache.org/tika">Apache Tika</a> (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as <a href="http://incubator.apache.org/projects/pdfbox.html">PDFBox</a>, <a href="http://poi.apache.org">POI,</a> and others into a single, easy to use framework that makes it easy to add extracted content to Lucene, Solr and any other text application.  This is something many of us do whenever we work with file formats and it is also relates to one of the most frequently asked questions on the user mailing lists (that is, &#8220;how do I get text from Word/PDF/Excel?)</p>
<p>Tika&#8217;s interface is very similar to SAX, so it is easy to think about extraction in terms of receiving SAX events just like you do with XML, which is also nice because it is thus streaming and doesn&#8217;t require you to load a whole document into memory before dealing with it.</p>
<p>I&#8217;m now in the process of incorporating Tika into Solr (see <a href="https://issues.apache.org/jira/browse/SOLR-284">SOLR-284</a>) and I think we&#8217;ll eventually see it hooked into the Data Import Handler (DIH) in Solr, too, such that one could easily get content from a DB or a URL with the DIH, and then extract it if it is a binary object.</p>
<p>Ultimately, Tika is one more piece to the puzzle when it comes to dealing with content and it fits well with my <strong>personal</strong> vision (i.e. removing my Lucene PMC hat) of what Lucene is and should become.  Namely, as we move forward beyond just search (since search is a commodity these days, thanks to Lucene), it is important to have a whole suite of tools to bring to bear on the problem of dealing with structured and unstructured data.  Thus, things like Lucene, Solr, Carrot2, UIMA, Mahout, Tika, OpenNLP and other tools all should be easily usable by text tamers (riffing on my &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8221; theme&#8230;)  in creating intelligent applications.  As Lucene continues to develop and grow, it should become easier and easier to build things using the Lucene Stack which should spur a new wave of ideas and opportunities for those paying attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
