<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Performance</title>
	<atom:link href="http://lucene.grantingersoll.com/category/performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Join the Lucene Revolution in Boston October 2010 &#124; www.lucenerevolution.org</title>
		<link>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/</link>
		<comments>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/#comments</comments>
		<pubDate>Mon, 17 May 2010 12:45:38 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Connector Framework]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Real Time Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spatial]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=368</guid>
		<description><![CDATA[Join the Lucene Revolution in Boston October 2010 &#124; www.lucenerevolution.org. Hope to see you in Boston!]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucenerevolution.com/">Join the Lucene Revolution in Boston October 2010 | www.lucenerevolution.org</a>.</p>
<p>Hope to see you in Boston!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2010/05/17/join-the-lucene-revolution-in-boston-october-2010-www-lucenerevolution-org/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Assumptions (in Apache Lucene and Solr and pretty much everything else) Considered Harmful</title>
		<link>http://lucene.grantingersoll.com/2009/09/22/assumptions-in-apache-lucene-and-solr-and-pretty-much-everything-else-considered-harmful/</link>
		<comments>http://lucene.grantingersoll.com/2009/09/22/assumptions-in-apache-lucene-and-solr-and-pretty-much-everything-else-considered-harmful/#comments</comments>
		<pubDate>Tue, 22 Sep 2009 17:04:51 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=268</guid>
		<description><![CDATA[I had a Football (American Football, that is, not soccer) coach who always used to drill into our heads what happens when one assumes something about our opponent for that week; he&#8217;d get all worked up, hoist up his coaching shorts (you know the ones, they should be banned&#8230;), puff out his chest, give you [...]]]></description>
			<content:encoded><![CDATA[<p>I had a Football (American Football, that is, not soccer) coach who always used to drill into our heads what happens when one assumes something about our opponent for that week; he&#8217;d get all worked up, hoist up his coaching shorts (you know the ones, they should be banned&#8230;), puff out his chest, give you a look that was part wink and part anger and say something to the effect of:  &#8220;You know what happens when you assume?  You make an ass out of &#8216;u&#8217; and &#8216;me&#8217;.&#8221;</p>
<p>That saying often comes back to me often these days when I hear again and again why some programmer chose to do something a certain way in Apache Lucene (and to some extent Solr, but less so since it takes care of most of the details of Lucene) , despite all documentation and community saying don&#8217;t do it that way.  The usual case for this involves paging deeper into the results, but I&#8217;ve seen it in many other areas as well, such as:</p>
<ol>
<li>Loading everything into RAMDirectory instead of just relying on O/S caching because &#8220;it&#8217;s faster&#8221;, even though they don&#8217;t quantify it</li>
<li>Faceting implementations</li>
<li>Overriding defaults without testing</li>
<li>Blindly using defaults without testing (stemming in particular!)</li>
<li>Using a very large JVM Heap, thus choking off memory for the O/S, because &#8220;more memory is better&#8221;</li>
</ol>
<p>In the paging case, the programmer thinks it is too expensive to execute the query a second or third time, so they go and retrieve 10 or 20 pages worth of results, if not much, much more, stuff them in a cache and then return the top ten.  There are many problems with this, the first being that most people don&#8217;t go beyond page one or two so all that work is wasted anyway.  The second is simply that Lucene is super fast at executing the search, never mind the fact that the Operating System probably cached everything it needs to do the search anyway and by you caching all that info you may have forced some of those O/S caches out of memory, thus slowing down subsequent searches.  Third, it creates excessive garbage which can lead to major collections, thus grinding the app to a halt.  Fourth, materializing the actual documents from disk for the search results is often expensive because doing so usually involves random seeks on disk.  In this case, the developer &#8220;assumes&#8221; Lucene would be slow at something, but didn&#8217;t bother to actually measure it to really know.  Often times what comes out of all this &#8220;premature optimization&#8221;, is a whole slew of code that now needs to be maintained, thus further complicating the application and making it harder for new developers to participate and make the application better, all while costing the company time and money.  Thus, assumptions in Lucene and Solr (and elsewhere) are <a href="http://en.wikipedia.org/wiki/Considered_harmful">Considered Harmful</a>.  See Lucene&#8217;s <a href="http://wiki.apache.org/lucene-java/BestPractices">Best Practices</a> page and Solr&#8217;s <a href="http://wiki.apache.org/solr/SolrPerformanceFactors">Performance Factors</a>, amongst other resources like &#8220;Lucene in Action&#8221; (2nd edition), for more info on how to do it right (or to challenge our assumptions!)</p>
<p>This isn&#8217;t just a Lucene/Solr phenomenon, and of course, I am not without sin, as I often catch myself or get caught by others, too.  Writing this is as much a reminder to me to be pragmatic and to test my assumptions as it is to anyone else.  Of course, one of the best things about a community like Lucene and Solr is having people who can objectively challenge your assumptions as opposed to  more traditional development models where it is often the case that the most experienced or most senior developers rule the roost and developers are shackled by the &#8220;one right way&#8221; to do things.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/09/22/assumptions-in-apache-lucene-and-solr-and-pretty-much-everything-else-considered-harmful/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucid Imagination » Understanding Lucene Performance – Free online workshop</title>
		<link>http://lucene.grantingersoll.com/2009/09/01/lucid-imagination-%c2%bb-understanding-lucene-performance-%e2%80%93-free-online-workshop/</link>
		<comments>http://lucene.grantingersoll.com/2009/09/01/lucid-imagination-%c2%bb-understanding-lucene-performance-%e2%80%93-free-online-workshop/#comments</comments>
		<pubDate>Tue, 01 Sep 2009 13:06:07 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=252</guid>
		<description><![CDATA[Andrezej Bialecki is giving a webinar for Lucid on Apache Lucene performance on Thursday.  More info is available at: Lucid Imagination » Understanding Lucene Performance – Free online workshop.]]></description>
			<content:encoded><![CDATA[<p>Andrezej Bialecki is giving a webinar for Lucid on Apache Lucene performance on Thursday.  More info is available at:</p>
<p><a href="http://www.lucidimagination.com/blog/2009/08/27/understanding-lucene-performance-free-online-workshop/">Lucid Imagination » Understanding Lucene Performance – Free online workshop</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/09/01/lucid-imagination-%c2%bb-understanding-lucene-performance-%e2%80%93-free-online-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel</title>
		<link>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/</link>
		<comments>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/#comments</comments>
		<pubDate>Tue, 19 May 2009 03:22:50 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=191</guid>
		<description><![CDATA[Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel. Daniel Tunkelang has written up an interesting post on the new Open Relevance Project that me and a few other Lucene people are starting up and I thought I would respond here: Little late to the conversation, but I think maybe we [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/">Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel</a>.</p>
<p>Daniel Tunkelang has written up an interesting post on the new <a href="http://wiki.apache.org/lucene-java/OpenRelevance">Open Relevance Project</a> that me and a few other Lucene people are starting up and I thought I would respond here:</p>
<blockquote><p>Little late to the conversation, but I think maybe we should back up a little bit.   I like a lot of the comments and wish they were actually made on general@lucene.apache.org where we are discussing the merits of the undertaking (see <a href="http://www.lucidimagination.com/search/document/76d7cdeed4882397">http://www.lucidimagination.com/search/document/76d7cdeed4882397</a>)  not that I expect that to happen given the way blogs work. At any rate, I&#8217;d like to add my two cents as the one who started the thread on general@lucene.apache.org.</p>
<p>First off, the ORP is <span style="text-decoration: underline;"><strong>VERY</strong></span> early stage brainstorming.  ORP really doesn&#8217;t warrant much attention at this point and it is premature to even speculate about how it relates to TREC, Google, Yahoo!, Microsoft or anything else.   I&#8217;m not even sure it has enough support to be a viable Lucene subproject! For now, I think most of us who are actually working on the genesis of the project are merely looking for a means to improve Lucene (and also Solr, Nutch and Mahout), despite what Otis says in his blog post about having grander notions for comparing across engines.</p>
<p>So, to the background&#8230;</p>
<p>This (ORP) is something I&#8217;ve been thinking about for a long time now and have discussed with a number of people in the past.    The motivation comes from my frustration over the years in not being able to obtain data that everyone on Lucene can use without limitations, since I&#8217;ve almost always worked in places that had little money to spend on this kind of thing.<br />
See <a href="http://www.lucidimagination.com/search/document/656d5ca50c8c9242">http://www.lucidimagination.com/search/document/656d5ca50c8c9242</a>, <a href="http://lucene.grantingersoll.com/category/trec/">http://lucene.grantingersoll.com/category/trec/</a> and <a href="http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/">http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/</a> for background.  The second motivation is simply to have practical, real world data driven by actual users.</p>
<p>In the past, I have talked with both NIST and Sheffield to try to work out terms by which the Lucene community could obtain TREC resources, but the licensing terms simply prevent a totally free redistribution.  (BTW, this is not NIST/Sheffield&#8217;s fault, but the company that allows them to use the data.  NIST/Sheffield are doing the best they can given their constraints.)  I have also talked with a few commercial companies that redistribute data (blogs, etc.) all to no avail (it&#8217;s usually the copyright that kills it.)  If the ASF were to buy the dataset, we could distribute to the committers on Lucene according to the licensing terms, but not to the broader community and we&#8217;d have to maintain a list of who has it, etc.  See the terms <a href="http://ir.dcs.gla.ac.uk/test_collections/">here</a>, for instance.  Since many of the best ideas come from the community in Open Source and you never know when and where they come from, I deemed this unacceptable and decided not to pursue it even though the ASF authorized me to go forward with it (i.e. spend the money) if I wanted to.  After all, it is only a few hundred bucks.</p>
<p>To me, it is vital that there be an open and <span style="text-decoration: underline;"><strong>FREE</strong></span> means for doing relevance tests that the Lucene community can use to improve itself.  If others can benefit, so be it.  Much like Lucene developed a benchmarking tool for people to share performance tests (both speed and relevance) in a straightforward way (see the contrib/benchmark section of the Lucene distribution), so to is there a need for us (speaking, unofficially, for Lucene) to talk about relevance in a public way so we can compare notes just as any two researchers buried in the bowels of a commercial company might compare notes.   Many, many people have used Lucene to do TREC (in fact, I have), but it is a showstopper when the other person you are discussing relevance with can&#8217;t just pick up the exact same bits (corpus, queries, judgments) and run the exact same tests.  In other words, the goal is not to compare competing offerings, IMO, (although it will likely happen b/c that is human nature) it is to give Lucene users a common way of evaluating and talking about relevance.</p>
<p>As anyone familiar with Lucene knows, ORP will be driven by the people that show up and volunteer to contribute to it, as are all Apache projects.  Thus, the slate really is clean.  If anyone (and I truly mean anyone, not just Lucene users, even though that is the preliminary focus) is interested, please show up and discuss over at general@lucene.apache.org.   We&#8217;d welcome the ideas and, moreover, any efforts.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr 1.3.0 Released</title>
		<link>http://lucene.grantingersoll.com/2008/09/17/solr-130-released/</link>
		<comments>http://lucene.grantingersoll.com/2008/09/17/solr-130-released/#comments</comments>
		<pubDate>Wed, 17 Sep 2008 12:45:48 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=106</guid>
		<description><![CDATA[Apache Solr 1.3.0 has been released.  This version contains many, many improvements and bug fixes.  High on my list are things like a good first step on distributed search support, integrated spell checking, support for Lucene&#8217;s &#8220;More Like This&#8221;, and the much needed Data Import Handler.  Of course, one can&#8217;t forget about the numerous performance [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org/solr">Apache Solr</a> 1.3.0 has been released.  This version contains many, many improvements and bug fixes.  High on my list are things like a good first step on distributed search support, integrated spell checking, support for Lucene&#8217;s &#8220;More Like This&#8221;, and the much needed Data Import Handler.  Of course, one can&#8217;t forget about the numerous performance improvements.  Since this release is using a very recent version of Lucene, it takes advantage of all of the improvements in Lucene 2.3 and above, which are quite significant.</p>
<p>With the Data Import Handler, for example, one can now easily hook Solr up to any database that supports JDBC and have it index the database for search.  Since most databases aren&#8217;t particularly good at text search, this will be a really attractive option for a lot of people who want to index a database.</p>
<p>The new distributed search support is also quite nice in that it allows developers to work with even larger collections without having to roll their own complex solution in low level Lucene.  While it currently requires some operations support to be bullet proof (think load balancers, etc.), it is a great first step and I know it is already used in production by several large installations.</p>
<p>As for spell checking, it used to be that one had to issue a separate query to get spelling results through the SpellcheckRequestHandler, now, thanks to Solr&#8217;s pluggable SearchComponent architecture, Google-style &#8220;Did you mean&#8221; results are available in the response with a query (assuming you turn the component on).</p>
<p>We also are now publishing maven artifacts (but note, I forgot to do it w/ the original distribution process, so they may not show up yet).</p>
<p>For all the features, fixes, etc. see the announcement on the Solr homepage and the link to the release notes.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/09/17/solr-130-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Text Processing: Why Servers Choke : Beyond Search</title>
		<link>http://lucene.grantingersoll.com/2008/09/07/text-processing-why-servers-choke-beyond-search/</link>
		<comments>http://lucene.grantingersoll.com/2008/09/07/text-processing-why-servers-choke-beyond-search/#comments</comments>
		<pubDate>Sun, 07 Sep 2008 12:35:57 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=101</guid>
		<description><![CDATA[Text Processing: Why Servers Choke : Beyond Search If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race. Are we reading the same paper?  This hardly says Lucene is a slow horse in the race.  [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><a href="http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/">Text Processing: Why Servers Choke : Beyond Search</a><br />
If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.</p></blockquote>
<p>Are we reading the same paper?  This hardly says Lucene is a slow horse in the race.  What it says, is that Lucene&#8217;s StandardTokenizer is slow in comparison to the papers approach for this one particular piece.  Quite a leap to say that Lucene overall is slow, which just doesn&#8217;t hold water with most people&#8217;s experience.  It also doesn&#8217;t compare it to other search engines.  Most notably, the authors fully admit that their comparison &#8220;is not strictly an apples to apples comparison&#8221; because Lucene&#8217;s StandardTokenizer does other things to produce tokens that are actually useful for the user further down the stream, like identifying email and web addresses, etc.  Don&#8217;t get me wrong, I&#8217;m not saying SpeedyFX isn&#8217;t interesting and worthwhile, just saying people shouldn&#8217;t infer something in a paper that isn&#8217;t there.</p>
<p>Note also that Lucene 2.3 has much improved indexing speed and this paper was written against 2.2, and many of the speedups focus on tokenization during the indexing process (i.e. object creation, object reuse, etc.).  We also upgraded our grammar to use JFlex, which we found to be much faster than JavaCC.  Can&#8217;t say what the numbers are in relation to this paper, but it would be interesting to see.  Perhaps the SpeedyFX people can share their code so we can all see.  I know, I know, researchers don&#8217;t like to do that, but to me it&#8217;s always a big gaping hole in these kinds of papers.</p>
<p>FWIW, StandardTokenizer is just one of many approaches to tokenization that Lucene provides. Furthermore, it is often not the long pole in the tent when it comes to indexing speed.</p>
<p>Still, the ideas are worth looking into.  Lucene&#8217;s always open to improvements, and all can benefit from them, as is the beauty of Open Source.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/09/07/text-processing-why-servers-choke-beyond-search/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)</title>
		<link>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/</link>
		<comments>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 12:57:55 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Map Reduce]]></category>
		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/</guid>
		<description><![CDATA[Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!) Congrats to the Hadoop team!  Score one for Open Source!]]></description>
			<content:encoded><![CDATA[<p><a href="http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html">Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)</a></p>
<p>Congrats to the Hadoop team!  Score one for Open Source!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/07/03/apache-hadoop-wins-terabyte-sort-benchmark-hadoop-and-distributed-computing-at-yahoo/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Open Source Search Relevance Follow Up</title>
		<link>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/#comments</comments>
		<pubDate>Thu, 22 May 2008 10:49:39 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=82</guid>
		<description><![CDATA[Jeff&#8217;s Search Engine Caffè Copyright and distribution issues Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><a href="http://www.searchenginecaffe.com/">Jeff&#8217;s Search Engine Caffè</a><br />
Copyright and distribution issues<br />
Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of 25 million government documents and can therefore be distributed without too much hassle. Not to mention there is little to no spam. However, there&#8217;s a problem: commercial documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don&#8217;t know the answer to that question. Could could that volume of data even be distributed?</p></blockquote>
<p>Right, we are not going to get into the distribution/copyright game.  We are going to focus on <strong>using</strong> collections that are freely available.  Each user would just be told what to download.</p>
<p>For example, we could do something like:</p>
<p>Have the user download a static version of Wikipedia from a specific date, index them however they see fit, then run a set of queries we develop and then rate the top 10 or 20 and post their results, including their actual implementation, which is always lacking other than the usual hand waving of saying &#8220;we did stemming and relevance feedback&#8221;.  We have the advantage in that we can say EXACTLY what we did, no question on implementation, so, gasp, others can repeat the exact experiments, like any good scientist does, before going on to improve it.   Then, when the next person comes along, they do the same thing.  If they disagree about the judgments for the same run, we have a discussion and one person convinces the other and we move on.   Next, someone will come along with a scoring improvement and post those results, and now people will know the current &#8220;best&#8221; algorithm for this set of data.</p>
<p>Lather, rinse, repeat for other collections, developed over time.  Any engine can submit, anybody can participate.  Open source at it&#8217;s best!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Source Search Engine Relevance</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comments</comments>
		<pubDate>Sun, 18 May 2008 11:23:11 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81</guid>
		<description><![CDATA[For a while now, I have been trying to get my hands on TREC data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an [...]]]></description>
			<content:encoded><![CDATA[<p>For a while now, I have been trying to get my hands on <a href="http://trec.nist.gov/">TREC</a> data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an engine performs.  While it isn&#8217;t the be all, end all for relevance, it is a pretty good sanity check on how you are doing.  For instance, many search engines do OK out of the box on it, but once you tune them, they can do much better.  Of course, you risk overtuning to TREC as well.</p>
<p>In TREC, the queries and the judgments are provided for free, but one has to pay for the data, or at least most of it, since it is usually owned by Reuters or some other organization.  It isn&#8217;t expensive or anything, but it is a barrier none the less, especially for an open source project.  Furthermore, the whole notion of paying for data in this day and age of open source and Creative Commons just doesn&#8217;t sit right with me.   Don&#8217;t get me wrong, I&#8217;m a big fan of TREC, having participated in the past, it provides a valuable service to the proprietary/academic IR community.</p>
<p>So, what does this have to do with Lucene?  When I say I am trying to get my hands on TREC data, I don&#8217;t mean just for me, I literally mean obtaining TREC data for Lucene.  That is, I want the data to be made available, ideally, for all Lucene (and, for that matter, all open source search engine) users to use and run experiments on so as to spur on innovation in Lucene&#8217;s scoring algorithms, etc.  Now, I know the copyright owners will never allow this, as I have asked.  So, my next thought was let&#8217;s just get it for internal use by committers at Apache.  So, I went back to TREC and we have an agreement to do this, more or less.  The problem, however, is that they say we can only use the data on ASF (Apache) machines.  Not a big deal, right?  Kind of.  The ASF doesn&#8217;t really have the hardware to run TREC style experiments.  We pretty much have one Solaris &#8220;zone&#8221; alloted us (a &#8220;zone&#8221; is a virtual machine guest image running.)  Furthermore, the ASF is pretty much an all volunteer, worldwide distributed organization.  We do almost all of our work on our own machines as VOLUNTEERS.   Practically speaking, the best way for any of us to take advantage of the data is to have it locally, which I am told, isn&#8217;t going to happen.</p>
<p>So, what&#8217;s the point?  I think it is time the open source search community (and I don&#8217;t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet.  Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc.  We wouldn&#8217;t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data.  Then, it is easy enough to provide simple scripts that do things like run Lucene&#8217;s contrib/benchmark Quality tasks against said data.</p>
<p>Practically speaking, I don&#8217;t think we even need to go as deep as TREC.  I think we would find the most use in making judgments on the top 10 or 20 results for any given query.</p>
<p>So, what do others think?  Am I off my rocker?  Are there any volunteers out there?  I think we could do this pretty simply through some scripts, and the effective use of a wiki.  I don&#8217;t think our goal is, in the short run, to be scientifically rigorous, but it should be over time.  Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them.  (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.)  Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.</p>
<p>FWIW, there is probably some alignment with the <a href="http://search.wikia.com/wiki/Search_Wikia">Wikia</a> search project.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>FeatherCast » Blog Archive » Episode 43: Lucene</title>
		<link>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/</link>
		<comments>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/#comments</comments>
		<pubDate>Fri, 22 Feb 2008 02:57:52 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[feathercast]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/</guid>
		<description><![CDATA[FeatherCast » Blog Archive » Episode 43: Lucene I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing&#8230;]]></description>
			<content:encoded><![CDATA[<p><a href="http://feathercast.org/?p=61">FeatherCast » Blog Archive » Episode 43: Lucene</a></p>
<p>I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/02/21/feathercast-%c2%bb-blog-archive-%c2%bb-episode-43-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

