<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; database</title>
	<atom:link href="http://lucene.grantingersoll.com/category/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>MySQL, Solr and &#8220;Communications link failure&#8221;</title>
		<link>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/#comments</comments>
		<pubDate>Wed, 16 Jul 2008 20:02:19 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[database]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[JDBC]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=87</guid>
		<description><![CDATA[So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure Last packet sent to the server was 4467745 ms ago ... com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) at In my code, I loop over a JDBC ResultSet and add the records [...]]]></description>
			<content:encoded><![CDATA[<p>So, I was indexing a 10+ million records in MySQL into Solr and kept coming across the following odd MySQL exception:</p>
<pre><tt><tt>com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
link failure
Last packet sent to the server was 4467745 ms ago
...
</tt></tt><tt><tt>
<pre>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2985) 	at
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) 	at</pre>
<p></tt></tt></pre>
<p>In my code, I loop over a JDBC ResultSet and add the records to Solr per the Solr field schema, mapping columns to fields, etc.  This would happen after getting through something like 9M+ records.  After some tracking down, hypothesizing, talking with others, we came to the conclusion that the issue was a combination of having the autocommit value in Solr set and MySQL timing out the ResultSet, such that when Lucene had to do a large merge (even in the background), Solr had to wait for said merge to finish, thus keeping the ResultSet open too long w/o activity.  Now, these large merges can take some time.  They can happen in the background, but Solr can&#8217;t refresh it&#8217;s IndexReader until the merge finishes, AIUI.  Thus, we&#8217;re stuck in the middle of a ResultSet loop, holding the cursor open past MySQL&#8217;s default setting (600 seconds, more on that later), causing MySQL to kill the connection, and rightfully so.  On the MySQL side of things, we are streaming the results, since it&#8217;s JDBC driver does not support setFetchSize() (ugh!).  As it turns out MySQL has a Streaming timeout value named <strong>netTimeoutForStreamingResults</strong> (see <a href="http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html">http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html</a>) which defaults to 600 seconds.</p>
<p>Long story short, I have at least two options:</p>
<ol>
<li>Turn off autocommit, meaning user&#8217;s won&#8217;t be able to see documents as soon as they may like</li>
<li>Increase the netTimoutForStreamingResults value.  This is great for MySQL and I have verified it works, but is not a generic value for other DBs, which our code supports</li>
</ol>
<p>I am still deciding on what to do, and also thinking of some other options that can decouple DB retrieval from the indexing process.  At any rate, I wanted to post the cause of my seeing this exception, because I did not see anyone else with this exception whose cause was due to a timeout during ResultSet processing and hopefully it will save them some time.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data &#124; High Scalability</title>
		<link>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/</link>
		<comments>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 21:15:37 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/</guid>
		<description><![CDATA[How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data &#124; High Scalability Nice article on how the Lucene/Hadoop/Solr stack was used to solve a really big problem.  Someday, I hope (when we have actual code),  they can add Mahout to the equation and do even more interesting things with the data.]]></description>
			<content:encoded><![CDATA[<p><a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data | High Scalability</a></p>
<p>Nice article on how the Lucene/Hadoop/Solr stack was used to solve a really big problem.  Someday, I hope (when we have actual code),  they can add <a href="http://lucene.apache.org/mahout">Mahout</a> to the equation and do even more interesting things with the data.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/02/01/how-rackspace-now-uses-mapreduce-and-hadoop-to-query-terabytes-of-data-high-scalability/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good Math, Bad Math : Databases are hammers; MapReduce is a screwdriver.</title>
		<link>http://lucene.grantingersoll.com/2008/01/26/good-math-bad-math-databases-are-hammers-mapreduce-is-a-screwdriver/</link>
		<comments>http://lucene.grantingersoll.com/2008/01/26/good-math-bad-math-databases-are-hammers-mapreduce-is-a-screwdriver/#comments</comments>
		<pubDate>Sat, 26 Jan 2008 13:28:04 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Map Reduce]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/01/26/good-math-bad-math-databases-are-hammers-mapreduce-is-a-screwdriver/</guid>
		<description><![CDATA[Good Math, Bad Math : Databases are hammers; MapReduce is a screwdriver. Well stated response to a criticism on Map Reduce.  Adding my own two cents, I once used Hadoop, a free open source implementation of Map Reduce (M/R) in a proof of concept implementation, to automatically translate (as in machine translation) a large (in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php">Good Math, Bad Math : Databases are hammers; MapReduce is a screwdriver.</a></p>
<p>Well stated response to a criticism on Map Reduce.  Adding my own two cents, I once used <a href="http://hadoop.apache.org">Hadoop</a>, a free open source implementation of Map Reduce (M/R) in a proof of concept implementation, to automatically translate (as in machine translation) a large (in my terms) collection of documents from one language (Arabic) to another (English).  It&#8217;s something that would be really hard to do in a database.  Besides, I had a bunch of dumb old machines laying around, while I didn&#8217;t have a $1 million plus license of Oracle laying around.</p>
<p>Other things M/R is nice for: crawling (see Nutch) and parallel indexing for search engines; log analysis,  machine learning, etc.</p>
<p>Finally, I first started doing parallel programming a fair number of years ago (remember the CM-5 from Thinking Machines?) and we used the Message Passing Interface APIs (MPI) amongst others.  As the author of the article above stresses, M/R is good for SOME large scale programs (see the new <a href="http://lucene.apache.org/mahout">Mahout</a> project at Apache, for some examples).  There are some problems that are really large and just don&#8217;t fit in the M/R model.  As with anything you do in life, take the time to figure out which one is right for you.  You may have to rise above yourself and learn something new.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/01/26/good-math-bad-math-databases-are-hammers-mapreduce-is-a-screwdriver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

