<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; Search</title>
	<atom:link href="http://lucene.grantingersoll.com/category/lucene/search/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Mon, 06 Feb 2012 12:07:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Lucid Imagination » Getting Started with Payloads</title>
		<link>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/</link>
		<comments>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/#comments</comments>
		<pubDate>Wed, 05 Aug 2009 14:25:12 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=239</guid>
		<description><![CDATA[I just posted a brief intro on getting started with Apache Lucene payloads on Lucid&#8217;s blog for those who are interested.  Here&#8217;s the teaser: Like Spans, payloads involve the position of terms, but go one step further. Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a [...]]]></description>
			<content:encoded><![CDATA[<p>I just posted a brief intro on getting started with Apache Lucene payloads on Lucid&#8217;s blog for those who are interested.  Here&#8217;s the teaser:</p>
<blockquote><p>Like Spans, payloads involve the position of terms, but go one step further.  Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index.  A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information.</p></blockquote>
<p>via <a href="http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/">Lucid Imagination » Getting Started with Payloads</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2009/08/05/lucid-imagination-%c2%bb-getting-started-with-payloads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Congrats to Tika and Welcome to the Lucene Stack!</title>
		<link>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 15:43:35 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[OpenNLP]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[Tika]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=130</guid>
		<description><![CDATA[Congratulations to Apache Tika (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]]]></description>
			<content:encoded><![CDATA[<p>Congratulations to <a href="http://incubator.apache.org/tika">Apache Tika</a> (nevermind the incubator address, it&#8217;s still in the process of migrating) for graduating from Incubation!   And welcome to the Lucene project!  Tika is a content extraction framework that wraps many other content extraction libraries such as <a href="http://incubator.apache.org/projects/pdfbox.html">PDFBox</a>, <a href="http://poi.apache.org">POI,</a> and others into a single, easy to use framework that makes it easy to add extracted content to Lucene, Solr and any other text application.  This is something many of us do whenever we work with file formats and it is also relates to one of the most frequently asked questions on the user mailing lists (that is, &#8220;how do I get text from Word/PDF/Excel?)</p>
<p>Tika&#8217;s interface is very similar to SAX, so it is easy to think about extraction in terms of receiving SAX events just like you do with XML, which is also nice because it is thus streaming and doesn&#8217;t require you to load a whole document into memory before dealing with it.</p>
<p>I&#8217;m now in the process of incorporating Tika into Solr (see <a href="https://issues.apache.org/jira/browse/SOLR-284">SOLR-284</a>) and I think we&#8217;ll eventually see it hooked into the Data Import Handler (DIH) in Solr, too, such that one could easily get content from a DB or a URL with the DIH, and then extract it if it is a binary object.</p>
<p>Ultimately, Tika is one more piece to the puzzle when it comes to dealing with content and it fits well with my <strong>personal</strong> vision (i.e. removing my Lucene PMC hat) of what Lucene is and should become.  Namely, as we move forward beyond just search (since search is a commodity these days, thanks to Lucene), it is important to have a whole suite of tools to bring to bear on the problem of dealing with structured and unstructured data.  Thus, things like Lucene, Solr, Carrot2, UIMA, Mahout, Tika, OpenNLP and other tools all should be easily usable by text tamers (riffing on my &#8220;<a href="http://www.manning.com/ingersoll">Taming Text</a>&#8221; theme&#8230;)  in creating intelligent applications.  As Lucene continues to develop and grow, it should become easier and easier to build things using the Lucene Stack which should spur a new wave of ideas and opportunities for those paying attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/13/congrats-to-tika-and-welcome-to-the-lucene-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Tao and the Art of Search: Yin Yang and TF-IDF</title>
		<link>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/#comments</comments>
		<pubDate>Sat, 08 Nov 2008 14:21:06 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[yin yang]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/</guid>
		<description><![CDATA[I often explain search and relevance at talks and training classes for Lucene and Solr.  In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as TF-IDF) in light of either the vector space model or in terms of determining relevance. [...]]]></description>
			<content:encoded><![CDATA[<p>I often explain search and relevance at talks and <a href="http://www.lucenebootcamp.com">training classes</a> for Lucene and Solr.  In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as <a href="http://en.wikipedia.org/wiki/Tf-idf">TF-IDF</a>) in light of either the vector space model or in terms of determining relevance.</p>
<p>The basic concept is that term frequency is the number of times the term occurs in a document while the IDF is the inverse of the number of times the term in question occurs in all of the documents.  Thus, the more often a term appears in a document, the more important the document.  The IDF, then acts as a counterbalance to the term frequency by saying that the more documents the term appears in, the less important it is overall in determining the importantce of the term and the containing document.  Hence, I usually explain TF-IDF as the &#8220;Yin and Yang of Search&#8221;, and this seems to resonate well with my students, as it pretty clearly demonstrates how the opposing forces work to creating meaningful results for end users.  Of course, as sometimes happens with opposing forces,  one outweighs the other leading to bad results.</p>
<p>For more on the yin yang, see<a href="http://en.wikipedia.org/wiki/Yin_and_yang"> Yin and yang &#8211; Wikipedia, the free encyclopedia</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/08/tao-and-the-art-of-search-yin-yang-and-tf-idf/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>&#8220;What&#8217;s new with Apache Solr&#8221; now available at IBM developerWorks</title>
		<link>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/</link>
		<comments>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/#comments</comments>
		<pubDate>Wed, 05 Nov 2008 16:28:26 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/</guid>
		<description><![CDATA[What&#8217;s new with Apache Solr. My latest article on Apache Solr, title &#8220;What&#8217;s New with Apache Solr&#8221; is now available over at IBM developerWorks.  It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. &#8220;paid placement&#8221;), SolrJ and a variety of other pieces. Hope it is [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.ibm.com/developerworks/java/library/j-solr-update/?S_TACT=105AGX01&amp;S_CMP=HP">What&#8217;s new with Apache Solr</a>.</p>
<p>My latest article on Apache Solr, title &#8220;What&#8217;s New with Apache Solr&#8221; is now available over at IBM developerWorks.  It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. &#8220;paid placement&#8221;), SolrJ and a variety of other pieces.</p>
<p>Hope it is helpful&#8230;  Feel free to give me any feedback.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/11/05/whats-new-with-apache-solr-now-available-at-ibm-developerworks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Lucene Boot Camp at ApacheCon US 2008</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 18:39:14 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Boot Camp]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=118</guid>
		<description><![CDATA[Just a quick reminder that there is just over one week left before Lucene Boot Camp at this year&#8217;s ApacheCon. This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend Solr Boot Camp on the second [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick reminder that there is just over one week left before <a href="http://www.lucenebootcamp.com">Lucene Boot Camp</a> at this year&#8217;s <a href="http://us.apachecon.com/">ApacheCon</a>.</p>
<p>This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend <a href="http://www.solrbootcamp.com">Solr Boot Camp</a> on the second day.  I am going to structure the class such that the first day is more high-level stuff (but still with some coding) while the second day will be much more hands-on.  Thus, for managers, CTOs, non-programmers, who just want to understand Lucene and Solr, they can easily plan on attending 1 day of each.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/lucene-boot-camp-at-apachecon-us-2008/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Some New Features in Solr</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 12:41:08 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[tokenization]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=116</guid>
		<description><![CDATA[I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.</p>
<p>First off, is <a href="https://issues.apache.org/jira/browse/SOLR-651">SOLR-651</a>, which implements what I am calling a <a href="http://wiki.apache.org/solr/TermVectorComponent">Term Vector Component.</a> The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.)  Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity.  This component can provide:</p>
<ol>
<li>Term</li>
<li>Term Frequency</li>
<li>Position (based on analysis)</li>
<li>Offset (character based)</li>
<li>IDF &#8211; Inverse Document Frequency</li>
</ol>
<p>Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what <a href="http://www.manning.com/ingersoll">Taming Text</a> is going to show.)  For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics.  And I haven&#8217;t even mentioned search, faceting and spell checking yet.  Nor have I mentioned the other thing I am working on:  adding search-result and document clustering to Solr.  This is taking place on <a href="https://issues.apache.org/jira/browse/SOLR-769">SOLR-769</a>.  The basic implementation I have now does search result clustering using the <a href="http://project.carrot2.org/">Carrot2</a> open source project.  After that, I plan on adding in Mahout for document based clustering.  I also know that Tom Morton, for Taming Text, has added in <a href="http://opennlp.sourceforge.net/">OpenNLP</a>&#8216;s Named Entity Recognition into Solr.  Some point in the near future, I&#8217;ll put up a link to that code.</p>
<p>Bottom line: Solr ain&#8217;t just for search anymore!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Lucene Boot Camp at ApacheCon US</title>
		<link>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 12:25:31 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Lucene Boot Camp]]></category>
		<category><![CDATA[Lucid Imagination]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=95</guid>
		<description><![CDATA[Lucene Boot Camp (ApacheCon site) Lucene Boot Camp (http://www.lucenebootcamp.com) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://us.apachecon.com/c/acus2008/sessions/69">Lucene Boot Camp (ApacheCon site)<br />
</a></p>
<p>Lucene Boot Camp (<a href="http://www.lucenebootcamp.com">http://www.lucenebootcamp.com</a>) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans.  This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not that two days is either, but&#8230;)  Additionally, by having two days, the class will have more time to explore and develop the examples with the class and more one on one time.  There is also a fair amount of new material in Lucene these days, so I am going to be updating the examples, etc. on the website and changing things up a bit.</p>
<p>This class has traditionally filled up pretty fast.  In Amsterdam this past April, there was even a waiting list, so I recommend signing up early.</p>
<p>Also, for those coming from the US, note that the class is scheduled over on Tuesday, Nov. 4, which is election day.  Not sure who decided that one, but, if you&#8217;re planning on voting and attending my class, make sure you get an absentee ballot.  Which reminds me&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/20/lucene-boot-camp-at-apachecon-us/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>wpSearch &#8211; Lucene search for WordPress</title>
		<link>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/</link>
		<comments>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 12:36:46 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[wpSearch]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=93</guid>
		<description><![CDATA[Code Fury The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities. The plugin is easy enough to install, only thing that [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://codefury.net/">Code Fury</a></p>
<p>The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I&#8217;m obviously a big fan of Lucene and also never much cared for MySql&#8217;s search (in)capabilities.</p>
<p>The plugin is easy enough to install, only thing that struck me as a little odd was the need to set 777 on the permsissions.  Presumably, this is so it can write the index, but perhaps it would be better to store the index outside of the plugin infrastructure.  Of course, I don&#8217;t know what&#8217;s involved with writing plugins, etc. so not sure if that makes sense or not.</p>
<p>Indexing and enabling search was a snap, and I do think the results are good, even if my sites don&#8217;t have a ton of posts.  Indexing on my &#8220;main&#8221; site (<a href="http://www.grantingersoll.com">http://www.grantingersoll.com</a>) took slightly longer than here, but I do have more posts and comments there.</p>
<p>Only minor suggestion I would have is that the default boosts for title and content aren&#8217;t all that great.  I think the title boost was 1.8 and the content boost was 1.3.   I changed mine to be title: 5 and content: 2.  The way boosting works at indexing time, it has only 8 bits of granularity, there isn&#8217;t too much difference between 1.8 and 1.3 and I tend to think title matches are much more important.  Thus, I made them greater.  Still, very cool that the author has hooked field boosting in to begin with.</p>
<p>Things I would love to see:</p>
<ol>
<li>Highlighting</li>
<li>Spell checking</li>
</ol>
<p>All in all, seems to be a great little plugin.  And now, I can &#8220;eat my own dogfood&#8221; too!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/08/07/wpsearch-lucene-search-for-wordpress/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Realtime Search for Lucene</title>
		<link>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/</link>
		<comments>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/#comments</comments>
		<pubDate>Tue, 24 Jun 2008 12:44:38 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Real Time Search]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=85</guid>
		<description><![CDATA[[#LUCENE-1313] Ocean Realtime Search &#8211; ASF JIRA Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search.  This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it.  Looks like [...]]]></description>
			<content:encoded><![CDATA[<p><a href="https://issues.apache.org/jira/browse/LUCENE-1313">[#LUCENE-1313] Ocean Realtime Search &#8211; ASF JIRA</a></p>
<p>Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search.  This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it.  Looks like Jason finally has, and I am glad to see it.  I have, on occassion, met with clients, or users who truly have needed it, but most have usually been able to live with a little bit of a delay and the associated workarounds.  Now, it seems, they won&#8217;t have to.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/06/24/realtime-search-for-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Source Search Relevance Follow Up</title>
		<link>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/</link>
		<comments>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/#comments</comments>
		<pubDate>Thu, 22 May 2008 10:49:39 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[TREC]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=82</guid>
		<description><![CDATA[Jeff&#8217;s Search Engine Caffè Copyright and distribution issues Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><a href="http://www.searchenginecaffe.com/">Jeff&#8217;s Search Engine Caffè</a><br />
Copyright and distribution issues<br />
Let&#8217;s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of 25 million government documents and can therefore be distributed without too much hassle. Not to mention there is little to no spam. However, there&#8217;s a problem: commercial documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don&#8217;t know the answer to that question. Could could that volume of data even be distributed?</p></blockquote>
<p>Right, we are not going to get into the distribution/copyright game.  We are going to focus on <strong>using</strong> collections that are freely available.  Each user would just be told what to download.</p>
<p>For example, we could do something like:</p>
<p>Have the user download a static version of Wikipedia from a specific date, index them however they see fit, then run a set of queries we develop and then rate the top 10 or 20 and post their results, including their actual implementation, which is always lacking other than the usual hand waving of saying &#8220;we did stemming and relevance feedback&#8221;.  We have the advantage in that we can say EXACTLY what we did, no question on implementation, so, gasp, others can repeat the exact experiments, like any good scientist does, before going on to improve it.   Then, when the next person comes along, they do the same thing.  If they disagree about the judgments for the same run, we have a discussion and one person convinces the other and we move on.   Next, someone will come along with a scoring improvement and post those results, and now people will know the current &#8220;best&#8221; algorithm for this set of data.</p>
<p>Lather, rinse, repeat for other collections, developed over time.  Any engine can submit, anybody can participate.  Open source at it&#8217;s best!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/05/22/open-source-search-relevance-follow-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

