<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Indexing</title>
	<atom:link href="http://lucene.grantingersoll.com/category/lucene/indexing/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Thu, 08 Jul 2010 17:23:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Lucid Imagination » Getting Started with Payloads</title>
		<link>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/</link>
		<comments>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 14:25:12 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=239</guid>
		<description><![CDATA[I just posted a brief intro on getting started with Apache Lucene payloads on Lucid&#8217;s blog for those who are interested.  Here&#8217;s the teaser: Like Spans, payloads involve the position of terms, but go one step further. Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a [...]]]></description>
			<content:encoded><![CDATA[<p>I just posted a brief intro on getting started with Apache Lucene payloads on Lucid&#8217;s blog for those who are interested.  Here&#8217;s the teaser:</p>
<blockquote><p>Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information.</p></blockquote>
<p>via <a href="http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/">Lucid Imagination » Getting Started with Payloads</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;What&#8217;s new with Apache Solr&#8221; now available at IBM developerWorks</title>
		<link>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/#comments</comments>
		<pubDate>Wed, 05 Nov 2008 16:28:26 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/</guid>
		<description><![CDATA[What&#8217;s new with Apache Solr. My latest article on Apache Solr, title &#8220;What&#8217;s New with Apache Solr&#8221; is now available over at IBM developerWorks.  It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. &#8220;paid placement&#8221;), SolrJ and a variety of other pieces. Hope it is [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.ibm.com/developerworks/java/library/j-solr-update/?S_TACT=105AGX01&amp;S_CMP=HP">What&#8217;s new with Apache Solr</a>.</p>
<p>My latest article on Apache Solr, title &#8220;What&#8217;s New with Apache Solr&#8221; is now available over at IBM developerWorks.  It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. &#8220;paid placement&#8221;), SolrJ and a variety of other pieces.</p>
<p>Hope it is helpful&#8230;  Feel free to give me any feedback.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Lucene Boot Camp at ApacheCon US 2008</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 18:39:14 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Boot Camp]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=118</guid>
		<description><![CDATA[Just a quick reminder that there is just over one week left before Lucene Boot Camp at this year&#8217;s ApacheCon. This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend Solr Boot Camp on the second [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick reminder that there is just over one week left before <a href="http://www.lucenebootcamp.com">Lucene Boot Camp</a> at this year&#8217;s <a href="http://us.apachecon.com/">ApacheCon</a>.</p>
<p>This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend <a href="http://www.solrbootcamp.com">Solr Boot Camp</a> on the second day.  I am going to structure the class such that the first day is more high-level stuff (but still with some coding) while the second day will be much more hands-on.  Thus, for managers, CTOs, non-programmers, who just want to understand Lucene and Solr, they can easily plan on attending 1 day of each.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Lucene Boot Camp at ApacheCon US</title>
		<link>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 12:25:31 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Boot Camp]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=95</guid>
		<description><![CDATA[Lucene Boot Camp (ApacheCon site) Lucene Boot Camp (http://www.lucenebootcamp.com) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://us.apachecon.com/c/acus2008/sessions/69">Lucene Boot Camp (ApacheCon site)<br />
</a></p>
<p>Lucene Boot Camp (<a href="http://www.lucenebootcamp.com">http://www.lucenebootcamp.com</a>) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not that two days is either, but&#8230;)  Additionally, by having two days, the class will have more time to explore and develop the examples with the class and more one on one time.  There is also a fair amount of new material in Lucene these days, so I am going to be updating the examples, etc. on the website and changing things up a bit.</p>
<p>This class has traditionally filled up pretty fast.  In Amsterdam this past April, there was even a waiting list, so I recommend signing up early.</p>
<p>Also, for those coming from the US, note that the class is scheduled over on Tuesday, Nov. 4, which is election day.  Not sure who decided that one, but, if you&#8217;re planning on voting and attending my class, make sure you get an absentee ballot.  Which reminds me&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>wpSearch &#8211; Lucene search for WordPress</title>
		<link>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 12:36:46 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[wpSearch]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=93</guid>
		<description><![CDATA[Code Fury The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities. The plugin is easy enough to install, only thing that [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://codefury.net/">Code Fury</a></p>
<p>The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities.</p>
<p>The plugin is easy enough to install, only thing that struck me as a little odd was the need to set 777 on the permsissions.  Presumably, this is so it can write the index, but perhaps it would be better to store the index outside of the plugin infrastructure.  Of course, I don&#8217;t know what&#8217;s involved with writing plugins, etc. so not sure if that makes sense or not.</p>
<p>Indexing and enabling search was a snap, and I do think the results are good, even if my sites don&#8217;t have a ton of posts.  Indexing on my &#8220;main&#8221; site (<a href="http://www.grantingersoll.com">http://www.grantingersoll.com</a>) took slightly longer than here, but I do have more posts and comments there.</p>
<p>Only minor suggestion I would have is that the default boosts for title and content aren&#8217;t all that great.  I think the title boost was 1.8 and the content boost was 1.3.   I changed mine to be title: 5 and content: 2.  The way boosting works at indexing time, it has only 8 bits of granularity, there isn&#8217;t too much difference between 1.8 and 1.3 and I tend to think title matches are much more important.  Thus, I made them greater.  Still, very cool that the author has hooked field boosting in to begin with.</p>
<p>Things I would love to see:</p>
<ol>
<li>Highlighting</li>
<li>Spell checking</li>
</ol>
<p>All in all, seems to be a great little plugin.  And now, I can &#8220;eat my own dogfood&#8221; too!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL, Solr and &#8220;Communications link failure&#8221;</title>
		<link>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/#comments</comments>
		<pubDate>Wed, 16 Jul 2008 20:02:19 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[JDBC]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=87</guid>
		<description><![CDATA[So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure Last packet sent to the server was 4467745 ms ago ... com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) at In my code, I loop over a JDBC ResultSet and add the records [...]]]></description>
			<content:encoded><![CDATA[<p>So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception:</p>
<pre><tt><tt>com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
link failure
Last packet sent to the server was 4467745 ms ago
...
</tt></tt><tt><tt>
<pre>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) 	at
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) 	at</pre>
<p></tt></tt></pre>
<p>In my code, I loop over a JDBC ResultSet and add the records to Solr per the Solr field schema, mapping columns to fields, etc.  This would happen after getting through something like 9M+ records.  After some tracking down, hypothesizing, talking with others, we came to the conclusion that the issue was a combination of having the autocommit value in Solr set and MySQL timing out the ResultSet, such that when Lucene had to do a large merge (even in the background), Solr had to wait for said merge to finish, thus keeping the ResultSet open too long w/o activity.  Now, these large merges can take some time.  They can happen in the background, but Solr can&#8217;t refresh it&#8217;s IndexReader until the merge finishes, AIUI.  Thus, we&#8217;re stuck in the middle of a ResultSet loop, holding the cursor open past MySQL&#8217;s default setting (600 seconds, more on that later), causing MySQL to kill the connection, and rightfully so.  On the MySQL side of things, we are streaming the results, since it&#8217;s JDBC driver does not support setFetchSize() (ugh!).  As it turns out MySQL has a Streaming timeout value named <strong>netTimeoutForStreamingResults</strong> (see <a href="http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html">http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html</a>) which defaults to 600 seconds.</p>
<p>Long story short, I have at least two options:</p>
<ol>
<li>Turn off autocommit, meaning user&#8217;s won&#8217;t be able to see documents as soon as they may like</li>
<li>Increase the netTimoutForStreamingResults value.  This is great for MySQL and I have verified it works, but is not a generic value for other DBs, which our code supports</li>
</ol>
<p>I am still deciding on what to do, and also thinking of some other options that can decouple DB retrieval from the indexing process.  At any rate, I wanted to post the cause of my seeing this exception, because I did not see anyone else with this exception whose cause was due to a timeout during ResultSet processing and hopefully it will save them some time.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Why Lucene Isn&#8217;t That Good &#124; Javalobby</title>
		<link>http://lucene.grantingersoll.com/2008/03/28/why-lucene-isnt-that-good-javalobby/</link>
		<comments>http://lucene.grantingersoll.com/2008/03/28/why-lucene-isnt-that-good-javalobby/#comments</comments>
		<pubDate>Sat, 29 Mar 2008 01:22:11 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/03/28/why-lucene-isnt-that-good-javalobby/</guid>
		<description><![CDATA[Why Lucene Isn&#8217;t That Good &#124; Javalobby Patches welcome&#8230;  I know that is an old saw, but that is the only way it&#8217;s going to get better. There are some good points in here, and some stuff that is a bit dramatic. We do try to keep adapting Lucene and make it better, but in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://java.dzone.com/news/why-lucene-isnt-good?page=0%2C1">Why Lucene Isn&#8217;t That Good | Javalobby</a></p>
<p>Patches welcome&#8230;  I know that is an old saw, but that is the only way it&#8217;s going to get better.</p>
<p>There are some good points in here, and some stuff that is a bit dramatic.</p>
<p>We do try to keep adapting Lucene and make it better, but in some respects we are damned if we do, damned if we don&#8217;t.  The whole <a href="http://lucene.markmail.org/message/77qs2pjy3inzfddj?q=Fieldable">abstract vs. interface debate</a> has been going on for a long time on Lucene.  If we switched to interfaces, then people would be complaining constantly about how we break their code everytime we do a release if we introduce new methods.  If we leave things as abstract classes, then people like Cedric complain that Lucene is hard to extend.</p>
<p>As for the &#8220;final&#8221; declarations, the reason it works that way, is that we can&#8217;t see the future.  Often times, the things that are final now are legacy from back in the early days when we couldn&#8217;t imagine some of the uses that Lucene is now being used for, or as another commenter on the thread said, to avoid unintended consequences.  That&#8217;s why people submit patches and things get improved.  Sorry we can&#8217;t all work on Lucene nonstop all day.  If Lingway wants to hire me to do that, you know how to reach me! (at least, as a contractor, I&#8217;m not interested in leaving my current employment, but I do offer consulting.)</p>
<p>As for SpanQueries, yes they can be slower.  I&#8217;d love to have a discussion with someone like you who is a heavy user to see how to improve them.  Please send your profiling info ASAP to the java-dev mailing list!  Please don&#8217;t let all those hours you spent go to waste.  Even a half baked patch is a starting point.</p>
<p>I do agree about scoring being pluggable, but I gotta tell ya&#8217;, scoring is hard and not for the faint of heart.  It&#8217;s a whole other layer and doing it right means being fast and accurate and doing it wrong means deep, dark, scary rabbit holes where you don&#8217;t see light for days.  One of the simplest ways, however, to improve scoring is to change the length normalization.</p>
<p>As for some of the other &#8220;higher&#8221; features, like crawling/clustering, those are nice, but they don&#8217;t belong in the core of Lucene b/c not everyone needs them, although the number is increasing.  How many people have collections that go beyond 10-20M documents?  What would be nice, however is a contrib module or a layer above Lucene that provides all those nice things you want (you know, you can embed Solr in no time, by the way).  Lucene is meant to be really fast on one machine and to also play nice when you put in the appropriate distributed pieces.  It&#8217;s unfortunate that no one has donated the distributed piece yet (although Solr does now have it, thanks to Yonik!)</p>
<p>At any rate, thanks for the ideas.  Hope to see your patches soon!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/03/28/why-lucene-isnt-that-good-javalobby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data &#124; High Scalability</title>
		<link>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/</link>
		<comments>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 21:15:37 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/</guid>
		<description><![CDATA[How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data &#124; High Scalability Nice article on how the Lucene/Hadoop/Solr stack was used to solve a really big problem.  Someday, I hope (when we have actual code),  they can add Mahout to the equation and do even more interesting things with the data.]]></description>
			<content:encoded><![CDATA[<p><a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data | High Scalability</a></p>
<p>Nice article on how the Lucene/Hadoop/Solr stack was used to solve a really big problem.  Someday, I hope (when we have actual code),  they can add <a href="http://lucene.apache.org/mahout">Mahout</a> to the equation and do even more interesting things with the data.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coderspiel / January 2008</title>
		<link>http://lucene.grantingersoll.com/2008/01/21/coderspiel-january-2008/</link>
		<comments>http://lucene.grantingersoll.com/2008/01/21/coderspiel-january-2008/#comments</comments>
		<pubDate>Mon, 21 Jan 2008 15:47:00 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/01/21/coderspiel-january-2008/</guid>
		<description><![CDATA[Coderspiel / January 2008 I hardly think Lucene is creating an isolationist culture, nor do we think our project is perfect.  What we do agree on is that our time is better spent on figuring out how to make Lucene better, not how to spend our time doing UNIX administration in a virtual server environment.  [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://technically.us/code/archive/2008/1#item-5014">Coderspiel / January 2008</a></p>
<p>I hardly think Lucene is creating an isolationist culture, nor do we think our project is perfect.  What we do agree on is that our time is better spent on figuring out how to make Lucene better, not how to spend our time doing UNIX administration in a virtual server environment.  As I said before, we would love to do it, we just need a volunteer.  Since n8han is so concerned about us and our use of  Google in our little template search box that comes with <a href="http://forrest.apache.org">Forrest</a> and is turned on by default, maybe he would like to volunteer his time to set it up to use Nutch.  And don&#8217;t give the cop-out that you work on other projects, you don&#8217;t buy our &#8220;we have other things to do&#8221; argument, so why should we buy yours?</p>
<p>I&#8217;m all for criticism of Lucene.  I dish it out a fair amount, just read the java-dev mailing list (I try to do it in a constructive way), but I would rather it be about something that matters to Lucene, i.e how well it indexes and finds results or how we can improve _______, not about the fact that we don&#8217;t have a system administrator volunteering on the project.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/01/21/coderspiel-january-2008/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coderspiel / The right tool for the slob</title>
		<link>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/</link>
		<comments>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/#comments</comments>
		<pubDate>Sat, 19 Jan 2008 22:16:27 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/</guid>
		<description><![CDATA[Coderspiel / The right tool for the slob This guy&#8217;s comment system wasn&#8217;t working at the moment, so I will leave my comment here. This won&#8217;t make much sense without reading the post first: It&#8217;s funny you mention Wikipedia as an example, since they are running Lucene. As is Technorati and the Internet Archive. As [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://technically.us/code/x/the-right-tool-for-the-slob">Coderspiel / The right tool for the slob</a></p>
<p><strong>This guy&#8217;s comment system wasn&#8217;t working at the moment, so I will leave my comment here.  This won&#8217;t make much sense without reading the post first:</strong></p>
<p>It&#8217;s funny you mention Wikipedia as an example, since they are running Lucene.  As is Technorati and the Internet Archive.  As is IBM Omnifind Yahoo! Edition.  Are those big enough for you?  If not, then choose any of them at <a href="http://wiki.apache.org/lucene-java/PoweredBy">http://wiki.apache.org/lucene-java/PoweredBy</a><br />
And that list is just the companies who are public about it.</p>
<p>Speaking for myself (and not the ASF), as a Lucene developer, I would love to see us using it at Apache.  It is something we are well aware of and have discussed.  However, Lucene, like all Apache projects is VOLUNTEER and our volunteer infrastructure team is already loaded providing support to the actual products in terms of Subversion, JIRA, Confluence/MoinMoin, countless mailing lists, guarding against security attacks, creating new projects, etc.  Simply put, it requires resources and time.  Perhaps I can find some time between my day job and the volunteer work I do actually making the code better, supporting the community and occasionally administrating the nightly builds on our virtual servers, etc. to find time to deploy and maintain Nutch (which, mind you would do just fine for the job, just ask, aw never mind, we&#8217;ve been down that road) in a 24/7 high volume website.  Even Google or your ISP has people working in operations to make sure even the most stable things are running and not being attacked/spammed/you name it, so Apache would be no different.</p>
<p>And, just so we are clear, every developer of Lucene &#8220;eats the Lucene/Nutch/Solr dog food&#8221;, we just don&#8217;t necessarily do it at Apache.org.  I use it my day job.  I use it in pet projects, I recommend it to clients, etc.  I even use it in things that 5 years ago I would never have thought I would use it for (object stores, etc.)  If that isn&#8217;t eating my own dog food, than I don&#8217;t know what dog food tastes like.</p>
<p>Finally, I don&#8217;t think our priority is to be squeaky clean.  My personal one is to make sure Lucene is as good as it can be within my personal limitations.  Just go look at our JIRA  installation or our mailing lists to see all of the dirt.  We aren&#8217;t hiding it.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/01/19/coderspiel-the-right-tool-for-the-slob/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
