<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Open Source Search Engine Relevance</title>
	<atom:link href="http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<pubDate>Thu, 04 Dec 2008 00:01:39 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.3</generator>
		<item>
		<title>By: Jeff</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6057</link>
		<dc:creator>Jeff</dc:creator>
		<pubDate>Thu, 22 May 2008 04:21:27 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6057</guid>
		<description>Grant - Interesting ideas!  

What use cases should the collections be designed for?  Things that come to mind: enterprise search, web search, product search, etc...

Also, what about interactive retrieval scenarios?  This is one of the major drawbacks to TREC-like evaluations.  Pooling also has its limitations (see the recent Terabyte Track summaries).

Could a possible solution be a platform like Alexa's Web Search Platform?  The documents would be hosted on the cluster and not distributed (getting around copyright and distribution problems), but could be processed to create search indexes.  You could even create a working service and collect real queries.  Who knows, you could even before A/B relevance tests on the best systems using live traffic.  

See my blog for more thoughts.</description>
		<content:encoded><![CDATA[<p>Grant - Interesting ideas!  </p>
<p>What use cases should the collections be designed for?  Things that come to mind: enterprise search, web search, product search, etc&#8230;</p>
<p>Also, what about interactive retrieval scenarios?  This is one of the major drawbacks to TREC-like evaluations.  Pooling also has its limitations (see the recent Terabyte Track summaries).</p>
<p>Could a possible solution be a platform like Alexa&#8217;s Web Search Platform?  The documents would be hosted on the cluster and not distributed (getting around copyright and distribution problems), but could be processed to create search indexes.  You could even create a working service and collect real queries.  Who knows, you could even before A/B relevance tests on the best systems using live traffic.  </p>
<p>See my blog for more thoughts.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grant_ingersoll</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6054</link>
		<dc:creator>grant_ingersoll</dc:creator>
		<pubDate>Tue, 20 May 2008 10:57:28 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6054</guid>
		<description>I agree, Ian, but again, not sure if we need that data.  As I understand TREC, they have analysts who read documents, etc. and come up with queries knowing somewhat that there are answers available.

I think we could have people come up with queries and then we make human judgments just as any organization would do in house.  

The bigger question to me now, is this something people are interested in doing?  That is, would people be interested in starting/joining a "Lucene Relevance" project as a subproject of Lucene?  We can figure out the details from there.</description>
		<content:encoded><![CDATA[<p>I agree, Ian, but again, not sure if we need that data.  As I understand TREC, they have analysts who read documents, etc. and come up with queries knowing somewhat that there are answers available.</p>
<p>I think we could have people come up with queries and then we make human judgments just as any organization would do in house.  </p>
<p>The bigger question to me now, is this something people are interested in doing?  That is, would people be interested in starting/joining a &#8220;Lucene Relevance&#8221; project as a subproject of Lucene?  We can figure out the details from there.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6053</link>
		<dc:creator>Ian</dc:creator>
		<pubDate>Tue, 20 May 2008 04:24:19 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6053</guid>
		<description>I don't think it is the 'data' component that makes this difficult, but the analysis of which results are better.

Ideally you would need to use some kind of crowdsourcing to judge which result is more relevant. (ie 95% of users click on X when they search for 'moo'). 

The problem then becomes one of privacy. you need to approach a organization to produce a anonymous version of their query logs (which is hard).

If you could convince wikipedia to just release the search query, and the result clicked you could then use this as a basis of judgement.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t think it is the &#8216;data&#8217; component that makes this difficult, but the analysis of which results are better.</p>
<p>Ideally you would need to use some kind of crowdsourcing to judge which result is more relevant. (ie 95% of users click on X when they search for &#8216;moo&#8217;). </p>
<p>The problem then becomes one of privacy. you need to approach a organization to produce a anonymous version of their query logs (which is hard).</p>
<p>If you could convince wikipedia to just release the search query, and the result clicked you could then use this as a basis of judgement.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grant_ingersoll</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6051</link>
		<dc:creator>grant_ingersoll</dc:creator>
		<pubDate>Mon, 19 May 2008 19:53:34 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6051</guid>
		<description>I don't know that we need to use TREC data after the eval.  I think I am proposing something that is completely open and repeatable by anyone using freely available data.   Download X corpus, get queries from the project site, run them and make judgments on top 10-20 and post them/edit a wiki, lather, rinse, repeat. 

It isn't fully thought out as to how it all works.  I'm just saying there is a need for some type of open source relevance project.  It goes beyond Lucene, in my mind, but I'm happy to start it here.

I am no doubt sure there are smarter people out there than me that can figure out the details of how to make this be rigorous.  I just want to see if I can kickstart the discussion and get something going.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t know that we need to use TREC data after the eval.  I think I am proposing something that is completely open and repeatable by anyone using freely available data.   Download X corpus, get queries from the project site, run them and make judgments on top 10-20 and post them/edit a wiki, lather, rinse, repeat. </p>
<p>It isn&#8217;t fully thought out as to how it all works.  I&#8217;m just saying there is a need for some type of open source relevance project.  It goes beyond Lucene, in my mind, but I&#8217;m happy to start it here.</p>
<p>I am no doubt sure there are smarter people out there than me that can figure out the details of how to make this be rigorous.  I just want to see if I can kickstart the discussion and get something going.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6050</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Mon, 19 May 2008 17:27:08 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6050</guid>
		<description>How do you plan to use TREC data after the eval?  The data consists of evaluations of relevant/irrelevant/maybe-relevant for the top K (usually K = 100 or 200) results for each of the participants.  This gives you exact precision-at-K for submitted systems, but doesn't let you evaluate recall or even precision-at-K for systems developed after the eval.</description>
		<content:encoded><![CDATA[<p>How do you plan to use TREC data after the eval?  The data consists of evaluations of relevant/irrelevant/maybe-relevant for the top K (usually K = 100 or 200) results for each of the participants.  This gives you exact precision-at-K for submitted systems, but doesn&#8217;t let you evaluate recall or even precision-at-K for systems developed after the eval.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grant_ingersoll</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6048</link>
		<dc:creator>grant_ingersoll</dc:creator>
		<pubDate>Mon, 19 May 2008 13:45:39 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6048</guid>
		<description>The restriction, as I understand it, is one of distribution.  Taking the docs off of a "ASF machine" implies distribution and is a violation of the terms of the agreement.  We can copy it to machines where the ASF is responsible for setting up accounts, or the accounts are based on the ASF credentials</description>
		<content:encoded><![CDATA[<p>The restriction, as I understand it, is one of distribution.  Taking the docs off of a &#8220;ASF machine&#8221; implies distribution and is a violation of the terms of the agreement.  We can copy it to machines where the ASF is responsible for setting up accounts, or the accounts are based on the ASF credentials</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ken Krugler</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6047</link>
		<dc:creator>Ken Krugler</dc:creator>
		<pubDate>Mon, 19 May 2008 13:25:08 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6047</guid>
		<description>Hi Grant,

Is the "only on ASF machines" restriction a data storage restriction, or a data processing restriction?

If it's the former, then there would be obvious work-arounds.

And in either case, yes it would be great to have data to use for relevance research with Lucene and other open source search engines.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Grant,</p>
<p>Is the &#8220;only on ASF machines&#8221; restriction a data storage restriction, or a data processing restriction?</p>
<p>If it&#8217;s the former, then there would be obvious work-arounds.</p>
<p>And in either case, yes it would be great to have data to use for relevance research with Lucene and other open source search engines.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Naber</title>
		<link>http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/#comment-6046</link>
		<dc:creator>Daniel Naber</dc:creator>
		<pubDate>Sun, 18 May 2008 18:55:28 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=81#comment-6046</guid>
		<description>Grant, sounds like a great idea! Wikipedia data is available at http://stats.grok.se/, although that seems to be the page titles, not the user's original queries.</description>
		<content:encoded><![CDATA[<p>Grant, sounds like a great idea! Wikipedia data is available at <a href="http://stats.grok.se/" rel="nofollow">http://stats.grok.se/</a>, although that seems to be the page titles, not the user&#8217;s original queries.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
<a href="http://world-secure-channel.com/why/">vpn</a>
<a href="http://vpnomania.com/proxy-surf.html/">proxy</a>

<a href="http://world-secure-channel.com/">vpn service</a>
<!-- Dynamic Page Served (once) in -0.546 seconds -->
