<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Nutch</title>
	<atom:link href="http://lucene.grantingersoll.com/category/nutch/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Wed, 18 Jan 2012 13:33:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Join the Lucene Revolution in Boston October 2010 &#124; www.lucenerevolution.org</title>
		<link>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/</link>
		<comments>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/#comments</comments>
		<pubDate>Mon, 17 May 2010 12:45:38 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Real Time Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spatial]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=368</guid>
		<description><![CDATA[Join the Lucene Revolution in Boston October 2010 &#124; www.lucenerevolution.org. Hope to see you in Boston!]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucenerevolution.com/">Join the Lucene Revolution in Boston October 2010 | www.lucenerevolution.org</a>.</p>
<p>Hope to see you in Boston!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RedMonk Podcasts: Search as a database &#8211; Grant Ingersoll on Solr &amp; Lucene</title>
		<link>http://lucene.grantingersoll.com/2010/04/29/redmonk-podcasts-search-as-a-database-grant-ingersoll-on-solr-lucene/</link>
		<comments>http://lucene.grantingersoll.com/2010/04/29/redmonk-podcasts-search-as-a-database-grant-ingersoll-on-solr-lucene/#comments</comments>
		<pubDate>Thu, 29 Apr 2010 12:53:19 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=364</guid>
		<description><![CDATA[Really enjoyed my conversation with Michael Coté earlier in the week on NoSQL, Lucene/Solr as a persistence engine and a variety of other topics.  You can hear the conversation at: RedMonk Podcasts.]]></description>
			<content:encoded><![CDATA[<p>Really enjoyed my conversation with Michael Coté earlier in the week on NoSQL, Lucene/Solr as a persistence engine and a variety of other topics.  You can hear the conversation at: <a href="http://redmonk.libsyn.com/index.php?post_id=609599">RedMonk Podcasts</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2010/04/29/redmonk-podcasts-search-as-a-database-grant-ingersoll-on-solr-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Lucene EuroCon: Prague May 18-21 2010</title>
		<link>http://lucene.grantingersoll.com/2010/03/24/apache-lucene-eurocon-prague-may-18-21-2010/</link>
		<comments>http://lucene.grantingersoll.com/2010/03/24/apache-lucene-eurocon-prague-may-18-21-2010/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 00:09:53 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Europe]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=352</guid>
		<description><![CDATA[Just announced: Apache Lucene EuroCon &#8211; May 18-21 2010. CFP is now live for talks in Prague this May. I&#8217;m looking forward to my first visit to Prague and also looking forward to talking with all the great Lucene users over in Europe. Hope to see you there, Grant]]></description>
			<content:encoded><![CDATA[<p>Just announced: <a href="http://lucene-eurocon.org/">Apache Lucene EuroCon &#8211; May 18-21 2010</a>.</p>
<p>CFP is now live for talks in Prague this May.</p>
<p>I&#8217;m looking forward to my first visit to Prague and also looking forward to talking with all the great Lucene users over in Europe.</p>
<p>Hope to see you there,</p>
<p>Grant</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2010/03/24/apache-lucene-eurocon-prague-may-18-21-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SFBay Apache Lucene/Solr Meetup Jan 21st.</title>
		<link>http://lucene.grantingersoll.com/2010/01/12/sfbay-apache-lucenesolr-meetup-jan-21st/</link>
		<comments>http://lucene.grantingersoll.com/2010/01/12/sfbay-apache-lucenesolr-meetup-jan-21st/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 18:44:02 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=317</guid>
		<description><![CDATA[Details and RSVP at: SFBay Apache Lucene/Solr Meetup San Mateo, CA &#8211; Meetup.com.]]></description>
			<content:encoded><![CDATA[<p>Details and RSVP at: <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/">SFBay Apache Lucene/Solr Meetup San Mateo, CA &#8211; Meetup.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2010/01/12/sfbay-apache-lucenesolr-meetup-jan-21st/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SFBay Apache Lucene/Solr Meetup San Mateo, CA &#8211; Meetup.com</title>
		<link>http://lucene.grantingersoll.com/2009/05/23/sfbay-apache-lucenesolr-meetup-san-mateo-ca-meetupcom/</link>
		<comments>http://lucene.grantingersoll.com/2009/05/23/sfbay-apache-lucenesolr-meetup-san-mateo-ca-meetupcom/#comments</comments>
		<pubDate>Sat, 23 May 2009 11:29:19 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=193</guid>
		<description><![CDATA[SFBay Apache Lucene/Solr Meetup San Mateo, CA &#8211; Meetup.com. Lucene/Solr Meetup, June 3 http://www.meetup.com/SFBay-Lucene-Solr-Meetup/ Join us for an evening of presentations and discussion on Lucene/Solr/Nutch/Mahout (and the rest of the Lucene ecosystem), the Apache Open Source Search Engine/Platform, featuring: -Erik Hatcher, Apache Lucene/Solr PMC: Solr power your data: How to get up an running in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/">SFBay Apache Lucene/Solr Meetup San Mateo, CA &#8211; Meetup.com</a>.</p>
<p>Lucene/Solr Meetup, June 3</p>
<p>http://www.meetup.com/SFBay-Lucene-Solr-Meetup/</p>
<p>Join us for an evening of presentations and discussion on<br />
Lucene/Solr/Nutch/Mahout (and the rest of the Lucene ecosystem), the Apache Open Source Search Engine/Platform, featuring:</p>
<p>-Erik Hatcher, Apache Lucene/Solr PMC: Solr power<br />
your data: How to get up an running in 20 minutes or less<br />
-Grant Ingersoll, Apache Lucene/Solr PMC: New in Apache Solr 1.4 &#8212; faster performance, better replication, and more<br />
-Additional topics to be posted at the URL shortly</p>
<p>We&#8217;d also like to have 15 minute lightning talks where people present their uses of Lucene/Solr/Tika/Mahout/Nutch/Droids.</p>
<p>We&#8217;ll have some food and beverages.</p>
<p>RSVP &#8212; seats are limited &#8212; at http://www.meetup.com/SFBay-Lucene-Solr-Meetup/</p>
<p>Sponsored by: Lucid Imagination</p>
<p>Please email questions of list to talks@lucidimagination.com</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/05/23/sfbay-apache-lucenesolr-meetup-san-mateo-ca-meetupcom/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BarCamp wiki / BarCampRDU</title>
		<link>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 16:22:54 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[BarCampRDU]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Raleigh]]></category>
		<category><![CDATA[Triangle]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=91</guid>
		<description><![CDATA[BarCamp wiki / BarCampRDU I&#8217;ll be at BarCampRDU tomorrow.  I proposed two sessions, one on Hadoop and Mahout and one on Lucene and Solr.  I don&#8217;t think I really want to do both, but I would like to do at least one, so we&#8217;ll see what other people are interested in. If you&#8217;re around and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://barcamp.org/BarCampRDU">BarCamp wiki / BarCampRDU</a></p>
<p>I&#8217;ll be at BarCampRDU tomorrow.  I proposed two sessions, one on Hadoop and Mahout and one on Lucene and Solr.  I don&#8217;t think I really want to do both, but I would like to do at least one, so we&#8217;ll see what other people are interested in.</p>
<p>If you&#8217;re around and you want to talk about any of these things, track me down.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Open Source Search Engine Relevance</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comments</comments>
		<pubDate>Sun, 18 May 2008 11:23:11 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81</guid>
		<description><![CDATA[For a while now, I have been trying to get my hands on TREC data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an [...]]]></description>
			<content:encoded><![CDATA[<p>For a while now, I have been trying to get my hands on <a href="http://trec.nist.gov/">TREC</a> data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an engine performs.  While it isn&#8217;t the be all, end all for relevance, it is a pretty good sanity check on how you are doing.  For instance, many search engines do OK out of the box on it, but once you tune them, they can do much better.  Of course, you risk overtuning to TREC as well.</p>
<p>In TREC, the queries and the judgments are provided for free, but one has to pay for the data, or at least most of it, since it is usually owned by Reuters or some other organization.  It isn&#8217;t expensive or anything, but it is a barrier none the less, especially for an open source project.  Furthermore, the whole notion of paying for data in this day and age of open source and Creative Commons just doesn&#8217;t sit right with me.   Don&#8217;t get me wrong, I&#8217;m a big fan of TREC, having participated in the past, it provides a valuable service to the proprietary/academic IR community.</p>
<p>So, what does this have to do with Lucene?  When I say I am trying to get my hands on TREC data, I don&#8217;t mean just for me, I literally mean obtaining TREC data for Lucene.  That is, I want the data to be made available, ideally, for all Lucene (and, for that matter, all open source search engine) users to use and run experiments on so as to spur on innovation in Lucene&#8217;s scoring algorithms, etc.  Now, I know the copyright owners will never allow this, as I have asked.  So, my next thought was let&#8217;s just get it for internal use by committers at Apache.  So, I went back to TREC and we have an agreement to do this, more or less.  The problem, however, is that they say we can only use the data on ASF (Apache) machines.  Not a big deal, right?  Kind of.  The ASF doesn&#8217;t really have the hardware to run TREC style experiments.  We pretty much have one Solaris &#8220;zone&#8221; alloted us (a &#8220;zone&#8221; is a virtual machine guest image running.)  Furthermore, the ASF is pretty much an all volunteer, worldwide distributed organization.  We do almost all of our work on our own machines as VOLUNTEERS.   Practically speaking, the best way for any of us to take advantage of the data is to have it locally, which I am told, isn&#8217;t going to happen.</p>
<p>So, what&#8217;s the point?  I think it is time the open source search community (and I don&#8217;t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet.  Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc.  We wouldn&#8217;t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data.  Then, it is easy enough to provide simple scripts that do things like run Lucene&#8217;s contrib/benchmark Quality tasks against said data.</p>
<p>Practically speaking, I don&#8217;t think we even need to go as deep as TREC.  I think we would find the most use in making judgments on the top 10 or 20 results for any given query.</p>
<p>So, what do others think?  Am I off my rocker?  Are there any volunteers out there?  I think we could do this pretty simply through some scripts, and the effective use of a wiki.  I don&#8217;t think our goal is, in the short run, to be scientifically rigorous, but it should be over time.  Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them.  (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.)  Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.</p>
<p>FWIW, there is probably some alignment with the <a href="http://search.wikia.com/wiki/Search_Wikia">Wikia</a> search project.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>FeatherCast » Blog Archive » Episode 43: Lucene</title>
		<link>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/</link>
		<comments>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/#comments</comments>
		<pubDate>Fri, 22 Feb 2008 02:57:52 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[feathercast]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/</guid>
		<description><![CDATA[FeatherCast » Blog Archive » Episode 43: Lucene I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing&#8230;]]></description>
			<content:encoded><![CDATA[<p><a href="http://feathercast.org/?p=61">FeatherCast » Blog Archive » Episode 43: Lucene</a></p>
<p>I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coderspiel / The right tool for the slob</title>
		<link>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/</link>
		<comments>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/#comments</comments>
		<pubDate>Sat, 19 Jan 2008 22:16:27 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/</guid>
		<description><![CDATA[Coderspiel / The right tool for the slob This guy&#8217;s comment system wasn&#8217;t working at the moment, so I will leave my comment here. This won&#8217;t make much sense without reading the post first: It&#8217;s funny you mention Wikipedia as an example, since they are running Lucene. As is Technorati and the Internet Archive. As [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://technically.us/code/x/the-right-tool-for-the-slob">Coderspiel / The right tool for the slob</a></p>
<p><strong>This guy&#8217;s comment system wasn&#8217;t working at the moment, so I will leave my comment here.  This won&#8217;t make much sense without reading the post first:</strong></p>
<p>It&#8217;s funny you mention Wikipedia as an example, since they are running Lucene.  As is Technorati and the Internet Archive.  As is IBM Omnifind Yahoo! Edition.  Are those big enough for you?  If not, then choose any of them at <a href="http://wiki.apache.org/lucene-java/PoweredBy">http://wiki.apache.org/lucene-java/PoweredBy</a><br />
And that list is just the companies who are public about it.</p>
<p>Speaking for myself (and not the ASF), as a Lucene developer, I would love to see us using it at Apache.  It is something we are well aware of and have discussed.  However, Lucene, like all Apache projects is VOLUNTEER and our volunteer infrastructure team is already loaded providing support to the actual products in terms of Subversion, JIRA, Confluence/MoinMoin, countless mailing lists, guarding against security attacks, creating new projects, etc.  Simply put, it requires resources and time.  Perhaps I can find some time between my day job and the volunteer work I do actually making the code better, supporting the community and occasionally administrating the nightly builds on our virtual servers, etc. to find time to deploy and maintain Nutch (which, mind you would do just fine for the job, just ask, aw never mind, we&#8217;ve been down that road) in a 24/7 high volume website.  Even Google or your ISP has people working in operations to make sure even the most stable things are running and not being attacked/spammed/you name it, so Apache would be no different.</p>
<p>And, just so we are clear, every developer of Lucene &#8220;eats the Lucene/Nutch/Solr dog food&#8221;, we just don&#8217;t necessarily do it at Apache.org.  I use it my day job.  I use it in pet projects, I recommend it to clients, etc.  I even use it in things that 5 years ago I would never have thought I would use it for (object stores, etc.)  If that isn&#8217;t eating my own dog food, than I don&#8217;t know what dog food tastes like.</p>
<p>Finally, I don&#8217;t think our priority is to be squeaky clean.  My personal one is to make sure Lucene is as good as it can be within my personal limitations.  Just go look at our JIRA  installation or our mailing lists to see all of the dirt.  We aren&#8217;t hiding it.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

