<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; relevance</title>
	<atom:link href="http://lucene.grantingersoll.com/category/relevance/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>SF Bay Area Lucene/Solr Meetup</title>
		<link>http://lucene.grantingersoll.com/2009/06/04/sf-bay-area-lucenesolr-meetup/</link>
		<comments>http://lucene.grantingersoll.com/2009/06/04/sf-bay-area-lucenesolr-meetup/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 17:49:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[canopy clustering]]></category>
		<category><![CDATA[Droids]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Latent Dirichlet Allocation]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Real Time Search]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tika]]></category>
		<category><![CDATA[Meetup]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=197</guid>
		<description><![CDATA[Just wanted to follow up on last night&#8217;s Lucene/Solr Meetup in San Francisco. First off, special thanks to all the speakers (Jason Rutherglen, Michael Busch, Erik Hatcher and all the lightning talks.)  We had a lot of excellent talks ranging from low level Lucene details on payloads and real time search to high level discussions [...]]]></description>
			<content:encoded><![CDATA[<p>Just wanted to follow up on last night&#8217;s Lucene/Solr <a href="http://www.meetup.com/SFBay-Lucene-Solr-Meetup/">Meetup</a> in San Francisco.</p>
<p>First off, special thanks to all the speakers (Jason Rutherglen, Michael Busch, Erik Hatcher and all the lightning talks.)  We had a lot of excellent talks ranging from low level Lucene details on payloads and real time search to high level discussions on new feature in Solr and best practices for working on stopwords and relevance.  Also had intros to <a href="http://lucene.apache.org/mahout">Mahout</a>, <a href="http://lucene.apache.org/tika">Tika</a> and the new <a href="http://www.lucidimagination.com/search/document/84205d273f3753c2/open_relevance_project_kickoff">Open Relevance</a> project at Lucene.  I&#8217;ll post the slides on the Meetup site when they are available (I am still waiting to get them from the speakers.)</p>
<p>Second, I really enjoyed engaging with so many people about what they are working on in Lucene/Solr.  It is always fun to hear all the different ways people are (ab)using Lucene/Solr to do cool things, etc.   It was especially good to meet some fellow Mahout committers (Ted Dunning and Jeff Eastman) for the first time, as well as one of Mahout&#8217;s Google Summer of Code student David Hall, who is working on adding <a href="http://www.lucidimagination.com/search/?q=Latent+Dirichlet">Latent Dirichlet Allocation</a>.</p>
<p>Finally, I look forward to doing more of these.  Right now, I&#8217;m looking for interest in Raleigh, NC, but I know we&#8217;ll likely have another one in the Bay Area again soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/06/04/sf-bay-area-lucenesolr-meetup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel</title>
		<link>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/</link>
		<comments>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/#comments</comments>
		<pubDate>Tue, 19 May 2009 03:22:50 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open Relevance]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=191</guid>
		<description><![CDATA[Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel. Daniel Tunkelang has written up an interesting post on the new Open Relevance Project that me and a few other Lucene people are starting up and I thought I would respond here: Little late to the conversation, but I think maybe we [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/">Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel</a>.</p>
<p>Daniel Tunkelang has written up an interesting post on the new <a href="http://wiki.apache.org/lucene-java/OpenRelevance">Open Relevance Project</a> that me and a few other Lucene people are starting up and I thought I would respond here:</p>
<blockquote><p>Little late to the conversation, but I think maybe we should back up a little bit.   I like a lot of the comments and wish they were actually made on general@lucene.apache.org where we are discussing the merits of the undertaking (see <a href="http://www.lucidimagination.com/search/document/76d7cdeed4882397">http://www.lucidimagination.com/search/document/76d7cdeed4882397</a>)  not that I expect that to happen given the way blogs work. At any rate, I&#8217;d like to add my two cents as the one who started the thread on general@lucene.apache.org.</p>
<p>First off, the ORP is <span style="text-decoration: underline;"><strong>VERY</strong></span> early stage brainstorming.  ORP really doesn&#8217;t warrant much attention at this point and it is premature to even speculate about how it relates to TREC, Google, Yahoo!, Microsoft or anything else.   I&#8217;m not even sure it has enough support to be a viable Lucene subproject! For now, I think most of us who are actually working on the genesis of the project are merely looking for a means to improve Lucene (and also Solr, Nutch and Mahout), despite what Otis says in his blog post about having grander notions for comparing across engines.</p>
<p>So, to the background&#8230;</p>
<p>This (ORP) is something I&#8217;ve been thinking about for a long time now and have discussed with a number of people in the past.    The motivation comes from my frustration over the years in not being able to obtain data that everyone on Lucene can use without limitations, since I&#8217;ve almost always worked in places that had little money to spend on this kind of thing.<br />
See <a href="http://www.lucidimagination.com/search/document/656d5ca50c8c9242">http://www.lucidimagination.com/search/document/656d5ca50c8c9242</a>, <a href="http://lucene.grantingersoll.com/category/trec/">http://lucene.grantingersoll.com/category/trec/</a> and <a href="http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/">http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/</a> for background.  The second motivation is simply to have practical, real world data driven by actual users.</p>
<p>In the past, I have talked with both NIST and Sheffield to try to work out terms by which the Lucene community could obtain TREC resources, but the licensing terms simply prevent a totally free redistribution.  (BTW, this is not NIST/Sheffield&#8217;s fault, but the company that allows them to use the data.  NIST/Sheffield are doing the best they can given their constraints.)  I have also talked with a few commercial companies that redistribute data (blogs, etc.) all to no avail (it&#8217;s usually the copyright that kills it.)  If the ASF were to buy the dataset, we could distribute to the committers on Lucene according to the licensing terms, but not to the broader community and we&#8217;d have to maintain a list of who has it, etc.  See the terms <a href="http://ir.dcs.gla.ac.uk/test_collections/">here</a>, for instance.  Since many of the best ideas come from the community in Open Source and you never know when and where they come from, I deemed this unacceptable and decided not to pursue it even though the ASF authorized me to go forward with it (i.e. spend the money) if I wanted to.  After all, it is only a few hundred bucks.</p>
<p>To me, it is vital that there be an open and <span style="text-decoration: underline;"><strong>FREE</strong></span> means for doing relevance tests that the Lucene community can use to improve itself.  If others can benefit, so be it.  Much like Lucene developed a benchmarking tool for people to share performance tests (both speed and relevance) in a straightforward way (see the contrib/benchmark section of the Lucene distribution), so to is there a need for us (speaking, unofficially, for Lucene) to talk about relevance in a public way so we can compare notes just as any two researchers buried in the bowels of a commercial company might compare notes.   Many, many people have used Lucene to do TREC (in fact, I have), but it is a showstopper when the other person you are discussing relevance with can&#8217;t just pick up the exact same bits (corpus, queries, judgments) and run the exact same tests.  In other words, the goal is not to compare competing offerings, IMO, (although it will likely happen b/c that is human nature) it is to give Lucene users a common way of evaluating and talking about relevance.</p>
<p>As anyone familiar with Lucene knows, ORP will be driven by the people that show up and volunteer to contribute to it, as are all Apache projects.  Thus, the slate really is clean.  If anyone (and I truly mean anyone, not just Lucene users, even though that is the preliminary focus) is interested, please show up and discuss over at general@lucene.apache.org.   We&#8217;d welcome the ideas and, moreover, any efforts.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Tao and the Art of Search: Yin Yang and TF-IDF</title>
		<link>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/#comments</comments>
		<pubDate>Sat, 08 Nov 2008 14:21:06 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[yin yang]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/</guid>
		<description><![CDATA[I often explain search and relevance at talks and training classes for Lucene and Solr.  In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as TF-IDF) in light of either the vector space model or in terms of determining relevance. [...]]]></description>
			<content:encoded><![CDATA[<p>I often explain search and relevance at talks and <a href="http://www.lucenebootcamp.com">training classes</a> for Lucene and Solr.  In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as <a href="http://en.wikipedia.org/wiki/Tf-idf">TF-IDF</a>) in light of either the vector space model or in terms of determining relevance.</p>
<p>The basic concept is that term frequency is the number of times the term occurs in a document while the IDF is the inverse of the number of times the term in question occurs in all of the documents.  Thus, the more often a term appears in a document, the more important the document.  The IDF, then acts as a counterbalance to the term frequency by saying that the more documents the term appears in, the less important it is overall in determining the importantce of the term and the containing document.  Hence, I usually explain TF-IDF as the &#8220;Yin and Yang of Search&#8221;, and this seems to resonate well with my students, as it pretty clearly demonstrates how the opposing forces work to creating meaningful results for end users.  Of course, as sometimes happens with opposing forces,  one outweighs the other leading to bad results.</p>
<p>For more on the yin yang, see<a href="http://en.wikipedia.org/wiki/Yin_and_yang"> Yin and yang &#8211; Wikipedia, the free encyclopedia</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Open Source Search Relevance Follow Up</title>
		<link>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/#comments</comments>
		<pubDate>Thu, 22 May 2008 10:49:39 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=82</guid>
		<description><![CDATA[Jeff&#8217;s Search Engine Caffè Copyright and distribution issues Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><a href="http://www.searchenginecaffe.com/">Jeff&#8217;s Search Engine Caffè</a><br />
Copyright and distribution issues<br />
Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of 25 million government documents and can therefore be distributed without too much hassle. Not to mention there is little to no spam. However, there&#8217;s a problem: commercial documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don&#8217;t know the answer to that question. Could could that volume of data even be distributed?</p></blockquote>
<p>Right, we are not going to get into the distribution/copyright game.  We are going to focus on <strong>using</strong> collections that are freely available.  Each user would just be told what to download.</p>
<p>For example, we could do something like:</p>
<p>Have the user download a static version of Wikipedia from a specific date, index them however they see fit, then run a set of queries we develop and then rate the top 10 or 20 and post their results, including their actual implementation, which is always lacking other than the usual hand waving of saying &#8220;we did stemming and relevance feedback&#8221;.  We have the advantage in that we can say EXACTLY what we did, no question on implementation, so, gasp, others can repeat the exact experiments, like any good scientist does, before going on to improve it.   Then, when the next person comes along, they do the same thing.  If they disagree about the judgments for the same run, we have a discussion and one person convinces the other and we move on.   Next, someone will come along with a scoring improvement and post those results, and now people will know the current &#8220;best&#8221; algorithm for this set of data.</p>
<p>Lather, rinse, repeat for other collections, developed over time.  Any engine can submit, anybody can participate.  Open source at it&#8217;s best!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Source Search Engine Relevance</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comments</comments>
		<pubDate>Sun, 18 May 2008 11:23:11 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81</guid>
		<description><![CDATA[For a while now, I have been trying to get my hands on TREC data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an [...]]]></description>
			<content:encoded><![CDATA[<p>For a while now, I have been trying to get my hands on <a href="http://trec.nist.gov/">TREC</a> data for the Lucene project.  For those who aren&#8217;t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an engine performs.  While it isn&#8217;t the be all, end all for relevance, it is a pretty good sanity check on how you are doing.  For instance, many search engines do OK out of the box on it, but once you tune them, they can do much better.  Of course, you risk overtuning to TREC as well.</p>
<p>In TREC, the queries and the judgments are provided for free, but one has to pay for the data, or at least most of it, since it is usually owned by Reuters or some other organization.  It isn&#8217;t expensive or anything, but it is a barrier none the less, especially for an open source project.  Furthermore, the whole notion of paying for data in this day and age of open source and Creative Commons just doesn&#8217;t sit right with me.   Don&#8217;t get me wrong, I&#8217;m a big fan of TREC, having participated in the past, it provides a valuable service to the proprietary/academic IR community.</p>
<p>So, what does this have to do with Lucene?  When I say I am trying to get my hands on TREC data, I don&#8217;t mean just for me, I literally mean obtaining TREC data for Lucene.  That is, I want the data to be made available, ideally, for all Lucene (and, for that matter, all open source search engine) users to use and run experiments on so as to spur on innovation in Lucene&#8217;s scoring algorithms, etc.  Now, I know the copyright owners will never allow this, as I have asked.  So, my next thought was let&#8217;s just get it for internal use by committers at Apache.  So, I went back to TREC and we have an agreement to do this, more or less.  The problem, however, is that they say we can only use the data on ASF (Apache) machines.  Not a big deal, right?  Kind of.  The ASF doesn&#8217;t really have the hardware to run TREC style experiments.  We pretty much have one Solaris &#8220;zone&#8221; alloted us (a &#8220;zone&#8221; is a virtual machine guest image running.)  Furthermore, the ASF is pretty much an all volunteer, worldwide distributed organization.  We do almost all of our work on our own machines as VOLUNTEERS.   Practically speaking, the best way for any of us to take advantage of the data is to have it locally, which I am told, isn&#8217;t going to happen.</p>
<p>So, what&#8217;s the point?  I think it is time the open source search community (and I don&#8217;t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet.  Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc.  We wouldn&#8217;t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data.  Then, it is easy enough to provide simple scripts that do things like run Lucene&#8217;s contrib/benchmark Quality tasks against said data.</p>
<p>Practically speaking, I don&#8217;t think we even need to go as deep as TREC.  I think we would find the most use in making judgments on the top 10 or 20 results for any given query.</p>
<p>So, what do others think?  Am I off my rocker?  Are there any volunteers out there?  I think we could do this pretty simply through some scripts, and the effective use of a wiki.  I don&#8217;t think our goal is, in the short run, to be scientifically rigorous, but it should be over time.  Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them.  (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.)  Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.</p>
<p>FWIW, there is probably some alignment with the <a href="http://search.wikia.com/wiki/Search_Wikia">Wikia</a> search project.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

