<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition</title>
	<atom:link href="http://lucene.grantingersoll.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<pubDate>Wed, 27 Aug 2008 14:24:59 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.1</generator>
	<language>en</language>
			<item>
		<title>Solr Logo Contest</title>
		<link>http://lucene.grantingersoll.com/2008/08/27/solr-logo-contest/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/27/solr-logo-contest/#comments</comments>
		<pubDate>Wed, 27 Aug 2008 14:24:59 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=99</guid>
		<description><![CDATA[LogoContest - Solr Wiki
Solr, an open source search server, is looking for a new logo.  See the link above for details.
]]></description>
			<content:encoded><![CDATA[<p><a href="http://wiki.apache.org/solr/LogoContest">LogoContest - Solr Wiki</a></p>
<p>Solr, an open source search server, is looking for a new logo.  See the link above for details.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/27/solr-logo-contest/feed/</wfw:commentRss>
		</item>
		<item>
		<title>FYI: Solar  Solr</title>
		<link>http://lucene.grantingersoll.com/2008/08/21/fyi-solar-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/21/fyi-solar-solr/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 12:48:56 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=97</guid>
		<description><![CDATA[Just in case you were wondering, solar does not equal solr.  :-)  My wife and I are building a solar home and I emailed the contractor the other day with a subject of: &#8220;Solr contractor&#8221; and the content said:
Just a reminder that I would appreciate it if I could talk w/ the Solr  person just [...]]]></description>
			<content:encoded><![CDATA[<p>Just in case you were wondering, solar does not equal solr.  :-)  My wife and I are building a <a href="http://mcfbuilders.com/">solar home</a> and I emailed the contractor the other day with a subject of: &#8220;Solr contractor&#8221; and the content said:</p>
<blockquote><p>Just a reminder that I would appreciate it if I could talk w/ the Solr  person just to hear more about the system and possibly plan for future  upgrades.</p></blockquote>
<p>Aargh, curse these modern spelling variations!  Curse these fingers that type faster than the brain can think!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/21/fyi-solar-solr/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Lucene Boot Camp at ApacheCon US</title>
		<link>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 12:25:31 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[ApacheCon]]></category>

		<category><![CDATA[Indexing]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[Lucene Boot Camp]]></category>

		<category><![CDATA[Lucid Imagination]]></category>

		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=95</guid>
		<description><![CDATA[Lucene Boot Camp (ApacheCon site)

Lucene Boot Camp (http://www.lucenebootcamp.com) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not that [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://us.apachecon.com/c/acus2008/sessions/69">Lucene Boot Camp (ApacheCon site)<br />
</a></p>
<p>Lucene Boot Camp (<a href="http://www.lucenebootcamp.com">http://www.lucenebootcamp.com</a>) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not that two days is either, but&#8230;)  Additionally, by having two days, the class will have more time to explore and develop the examples with the class and more one on one time.  There is also a fair amount of new material in Lucene these days, so I am going to be updating the examples, etc. on the website and changing things up a bit.</p>
<p>This class has traditionally filled up pretty fast.  In Amsterdam this past April, there was even a waiting list, so I recommend signing up early.</p>
<p>Also, for those coming from the US, note that the class is scheduled over on Tuesday, Nov. 4, which is election day.  Not sure who decided that one, but, if you&#8217;re planning on voting and attending my class, make sure you get an absentee ballot.  Which reminds me&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/feed/</wfw:commentRss>
		</item>
		<item>
		<title>wpSearch - Lucene search for WordPress</title>
		<link>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 12:36:46 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Indexing]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[MySQL]]></category>

		<category><![CDATA[Search]]></category>

		<category><![CDATA[spell checking]]></category>

		<category><![CDATA[wpSearch]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=93</guid>
		<description><![CDATA[Code Fury
The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities.
The plugin is easy enough to install, only thing that struck me [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://codefury.net/">Code Fury</a></p>
<p>The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities.</p>
<p>The plugin is easy enough to install, only thing that struck me as a little odd was the need to set 777 on the permsissions.  Presumably, this is so it can write the index, but perhaps it would be better to store the index outside of the plugin infrastructure.  Of course, I don&#8217;t know what&#8217;s involved with writing plugins, etc. so not sure if that makes sense or not.</p>
<p>Indexing and enabling search was a snap, and I do think the results are good, even if my sites don&#8217;t have a ton of posts.  Indexing on my &#8220;main&#8221; site (<a href="http://www.grantingersoll.com">http://www.grantingersoll.com</a>) took slightly longer than here, but I do have more posts and comments there.</p>
<p>Only minor suggestion I would have is that the default boosts for title and content aren&#8217;t all that great.  I think the title boost was 1.8 and the content boost was 1.3.   I changed mine to be title: 5 and content: 2.  The way boosting works at indexing time, it has only 8 bits of granularity, there isn&#8217;t too much difference between 1.8 and 1.3 and I tend to think title matches are much more important.  Thus, I made them greater.  Still, very cool that the author has hooked field boosting in to begin with.</p>
<p>Things I would love to see:</p>
<ol>
<li>Highlighting</li>
<li>Spell checking</li>
</ol>
<p>All in all, seems to be a great little plugin.  And now, I can &#8220;eat my own dogfood&#8221; too!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/feed/</wfw:commentRss>
		</item>
		<item>
		<title>BarCamp wiki / BarCampRDU</title>
		<link>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 16:22:54 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[BarCampRDU]]></category>

		<category><![CDATA[Hadoop]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[Mahout]]></category>

		<category><![CDATA[Map Reduce]]></category>

		<category><![CDATA[Nutch]]></category>

		<category><![CDATA[Raleigh]]></category>

		<category><![CDATA[Triangle]]></category>

		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=91</guid>
		<description><![CDATA[BarCamp wiki / BarCampRDU
I&#8217;ll be at BarCampRDU tomorrow.  I proposed two sessions, one on Hadoop and Mahout and one on Lucene and Solr.  I don&#8217;t think I really want to do both, but I would like to do at least one, so we&#8217;ll see what other people are interested in.
If you&#8217;re around and you want [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://barcamp.org/BarCampRDU">BarCamp wiki / BarCampRDU</a></p>
<p>I&#8217;ll be at BarCampRDU tomorrow.  I proposed two sessions, one on Hadoop and Mahout and one on Lucene and Solr.  I don&#8217;t think I really want to do both, but I would like to do at least one, so we&#8217;ll see what other people are interested in.</p>
<p>If you&#8217;re around and you want to talk about any of these things, track me down.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/01/barcamp-wiki-barcamprdu/feed/</wfw:commentRss>
		</item>
		<item>
		<title>HP, Intel and Yahoo To Research Cloud Computing - Yahoo News</title>
		<link>http://lucene.grantingersoll.com/2008/07/30/hp-intel-and-yahoo-to-research-cloud-computing-yahoo-news/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/30/hp-intel-and-yahoo-to-research-cloud-computing-yahoo-news/#comments</comments>
		<pubDate>Wed, 30 Jul 2008 20:58:10 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[Hadoop]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[Mahout]]></category>

		<category><![CDATA[Map Reduce]]></category>

		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=89</guid>
		<description><![CDATA[HP, Intel and Yahoo To Research Cloud Computing - Yahoo News
Boy, this could really come in handy in Open Source, especially projects like Mahout, Nutch and distributed Solr.  I find my biggest personal challenge on Mahout is access to computing resources.  I personally don&#8217;t have the financial backing to buy much time on Amazon EC2.  [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://news.yahoo.com/s/nf/20080729/bs_nf/61031">HP, Intel and Yahoo To Research Cloud Computing - Yahoo News</a></p>
<p>Boy, this could really come in handy in Open Source, especially projects like Mahout, Nutch and distributed Solr.  I find my biggest personal challenge on Mahout is access to computing resources.  I personally don&#8217;t have the financial backing to buy much time on Amazon EC2.  I have been scraping by, here and there, but find myself constantly wanting access to more capabilities.</p>
<p>Sigh.  Maybe I should put more ads on this site and use the funds for buying EC2 time.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/30/hp-intel-and-yahoo-to-research-cloud-computing-yahoo-news/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MySQL, Solr and &#8220;Communications link failure&#8221;</title>
		<link>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/#comments</comments>
		<pubDate>Wed, 16 Jul 2008 20:02:19 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Indexing]]></category>

		<category><![CDATA[JDBC]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[MySQL]]></category>

		<category><![CDATA[Solr]]></category>

		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=87</guid>
		<description><![CDATA[So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
link failure
Last packet sent to the server was 4467745 ms ago
...

com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) 	at
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) 	at

In my code, I loop over a JDBC ResultSet and add the records to Solr per the Solr field schema, mapping [...]]]></description>
			<content:encoded><![CDATA[<p>So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception:</p>
<pre><tt><tt>com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
link failure
Last packet sent to the server was 4467745 ms ago
...
</tt></tt><tt><tt>
<pre>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) 	at
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) 	at</pre>
<p></tt></tt></pre>
<p>In my code, I loop over a JDBC ResultSet and add the records to Solr per the Solr field schema, mapping columns to fields, etc.  This would happen after getting through something like 9M+ records.  After some tracking down, hypothesizing, talking with others, we came to the conclusion that the issue was a combination of having the autocommit value in Solr set and MySQL timing out the ResultSet, such that when Lucene had to do a large merge (even in the background), Solr had to wait for said merge to finish, thus keeping the ResultSet open too long w/o activity.  Now, these large merges can take some time.  They can happen in the background, but Solr can&#8217;t refresh it&#8217;s IndexReader until the merge finishes, AIUI.  Thus, we&#8217;re stuck in the middle of a ResultSet loop, holding the cursor open past MySQL&#8217;s default setting (600 seconds, more on that later), causing MySQL to kill the connection, and rightfully so.  On the MySQL side of things, we are streaming the results, since it&#8217;s JDBC driver does not support setFetchSize() (ugh!).  As it turns out MySQL has a Streaming timeout value named <strong>netTimeoutForStreamingResults</strong> (see <a href="http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html">http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html</a>) which defaults to 600 seconds.</p>
<p>Long story short, I have at least two options:</p>
<ol>
<li>Turn off autocommit, meaning user&#8217;s won&#8217;t be able to see documents as soon as they may like</li>
<li>Increase the netTimoutForStreamingResults value.  This is great for MySQL and I have verified it works, but is not a generic value for other DBs, which our code supports</li>
</ol>
<p>I am still deciding on what to do, and also thinking of some other options that can decouple DB retrieval from the indexing process.  At any rate, I wanted to post the cause of my seeing this exception, because I did not see anyone else with this exception whose cause was due to a timeout during ResultSet processing and hopefully it will save them some time.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)</title>
		<link>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 12:57:55 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[Hadoop]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Map Reduce]]></category>

		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/</guid>
		<description><![CDATA[Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)
Congrats to the Hadoop team!  Score one for Open Source!
]]></description>
			<content:encoded><![CDATA[<p><a href="http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html">Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)</a></p>
<p>Congrats to the Hadoop team!  Score one for Open Source!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Realtime Search for Lucene</title>
		<link>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/</link>
		<comments>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/#comments</comments>
		<pubDate>Tue, 24 Jun 2008 12:44:38 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Apache]]></category>

		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[Real Time Search]]></category>

		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=85</guid>
		<description><![CDATA[[#LUCENE-1313] Ocean Realtime Search - ASF JIRA
Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search.  This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it.  Looks like Jason [...]]]></description>
			<content:encoded><![CDATA[<p><a href="https://issues.apache.org/jira/browse/LUCENE-1313">[#LUCENE-1313] Ocean Realtime Search - ASF JIRA</a></p>
<p>Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search.  This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it.  Looks like Jason finally has, and I am glad to see it.  I have, on occassion, met with clients, or users who truly have needed it, but most have usually been able to live with a little bit of a delay and the associated workarounds.  Now, it seems, they won&#8217;t have to.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Solr Spell Checking Addition</title>
		<link>http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/</link>
		<comments>http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/#comments</comments>
		<pubDate>Sat, 21 Jun 2008 16:16:07 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
		
		<category><![CDATA[Lucene]]></category>

		<category><![CDATA[Solr]]></category>

		<category><![CDATA[spell checking]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=84</guid>
		<description><![CDATA[Just committed SOLR-572 yesterday, which adds a spell checking component to Solr.  Now, Solr had a spell checking request handler before, but a component is slightly different.  Request Handlers require separate calls, whereas a component can be inlined in a request.  Essentially, a Request Handler can be made up of one or more SearchComponents.
What this [...]]]></description>
			<content:encoded><![CDATA[<p>Just committed <a href="https://issues.apache.org/jira/browse/SOLR-572">SOLR-572</a> yesterday, which adds a spell checking component to Solr.  Now, Solr had a spell checking <a href="http://wiki.apache.org/solr/SolrRequestHandler">request handler</a> before, but a component is slightly different.  Request Handlers require separate calls, whereas a component can be inlined in a request.  Essentially, a Request Handler can be made up of one or more <a href="http://wiki.apache.org/solr/SearchComponent">SearchComponents</a>.</p>
<p>What this means, is that one can now get back search results for the given query, and get spelling suggestions at the same time, pretty much like Google&#8217;s &#8220;Did You Mean&#8221; functionality (but probably not the same quality, as they have a much bigger corpus and probably use user feedback as well.)</p>
<p>For details on how to use it, try out the Solr example in the source distribution and see the <a href="http://wiki.apache.org/solr/SpellCheckComponent">Wiki docs</a>.</p>
<p>Also note, that it allows one to plug in their own spell checker (or a commercial one) or use the Lucene spell checker.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.341 seconds -->
<!-- Cached page served by WP-Cache -->
