<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Grant's Grunts: Lucene Edition &#187; term vectors</title>
	<atom:link href="http://lucene.grantingersoll.com/category/term-vectors/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Thu, 08 Jul 2010 17:23:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Some New Features in Solr</title>
		<link>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/</link>
		<comments>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 12:41:08 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Manning]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Taming Text]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[spell checking]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[tokenization]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=116</guid>
		<description><![CDATA[I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene.  For [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.</p>
<p>First off, is <a href="https://issues.apache.org/jira/browse/SOLR-651">SOLR-651</a>, which implements what I am calling a <a href="http://wiki.apache.org/solr/TermVectorComponent">Term Vector Component.</a> The basic gist of it is that Solr can now serve up term vectors from Lucene.  For those not initiated, term vectors store the term, term frequency and, optionally, position and offset information in a document-centric way in Lucene (as opposed to the inverted index storage used for searching.)  Term Vectors are often useful for doing things besides search like highlighting, machine learning, document-document similarity.  This component can provide:</p>
<ol>
<li>Term</li>
<li>Term Frequency</li>
<li>Position (based on analysis)</li>
<li>Offset (character based)</li>
<li>IDF &#8211; Inverse Document Frequency</li>
</ol>
<p>Combining all of these things, plus a couple of other features, I think, can really enable Solr to act as a more general Text server (which is what <a href="http://www.manning.com/ingersoll">Taming Text</a> is going to show.)  For instance, the Analysis Request Handler can act as a Document Analyzer server, and the Luke Request Handler can provide all kinds of corpus statistics.  And I haven&#8217;t even mentioned search, faceting and spell checking yet.  Nor have I mentioned the other thing I am working on:  adding search-result and document clustering to Solr.  This is taking place on <a href="https://issues.apache.org/jira/browse/SOLR-769">SOLR-769</a>.  The basic implementation I have now does search result clustering using the <a href="http://project.carrot2.org/">Carrot2</a> open source project.  After that, I plan on adding in Mahout for document based clustering.  I also know that Tom Morton, for Taming Text, has added in <a href="http://opennlp.sourceforge.net/">OpenNLP</a>&#8216;s Named Entity Recognition into Solr.  Some point in the near future, I&#8217;ll put up a link to that code.</p>
<p>Bottom line: Solr ain&#8217;t just for search anymore!</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2008/10/23/some-new-features-in-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Lucene goodness</title>
		<link>http://lucene.grantingersoll.com/2007/11/02/lucene-goodness/</link>
		<comments>http://lucene.grantingersoll.com/2007/11/02/lucene-goodness/#comments</comments>
		<pubDate>Sat, 03 Nov 2007 01:31:59 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[term vectors]]></category>
		<category><![CDATA[Apache]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2007/11/02/lucene-goodness/</guid>
		<description><![CDATA[Lots of good things happening in Lucene land lately, all of which should benefit users with faster indexing and searching capabilities.  Most notably, Lucene 2.3 (hopefully released this quarter) has some major changes in indexing memory management and performance.  I have personally clocked indexing using release 2.2 at about 400 rec/s (single threaded, Mac Pro [...]]]></description>
			<content:encoded><![CDATA[<p>Lots of good things happening in Lucene land lately, all of which should benefit users with faster indexing and searching capabilities.  Most notably, Lucene 2.3 (hopefully released this quarter) has some major changes in indexing memory management and performance.  I have personally clocked indexing using release 2.2 at about 400 rec/s (single threaded, Mac Pro dual CPU/dual core, using the contrib/benchmark indexing.alg) to over 2,100 records/s on 2.3-dev (the latest trunk).  It also features easier control of the indexing process by specifying how much memory to give it, instead of the confusing maxBufferedDocs factor.</p>
<p>Other work being undertaken should speed up reopening IndexReader&#8217;s.  There also are a number of smaller changes including a faster StandardTokenizer (the tokenizer most people use) and faster term vector access.</p>
<p>Of course, with that comes more testing and a greater need to make sure the next release is rock solid and backwards compatible.    So, if you are a Lucene user, I would encourage you to give trunk a try on some of your non-production indexes, etc. and help us test it out.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2007/11/02/lucene-goodness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Advance Lucene slides from ApacheCon Europe 2007</title>
		<link>http://lucene.grantingersoll.com/2007/05/07/advance-lucene-slides-from-apachecon-europe-2007/</link>
		<comments>http://lucene.grantingersoll.com/2007/05/07/advance-lucene-slides-from-apachecon-europe-2007/#comments</comments>
		<pubDate>Mon, 07 May 2007 14:31:12 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Europe]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[payloads]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[term vectors]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2007/05/07/advance-lucene-slides-from-apachecon-europe-2007/</guid>
		<description><![CDATA[The latest version of my slides for &#8220;Advanced Lucene&#8221; are located at http://www.cnlp.org/presentations/present.asp?show=conference Talk covered term vectors, using various query types and Lucene performance tips and tricks.]]></description>
			<content:encoded><![CDATA[<p>The latest version of my slides for &#8220;Advanced Lucene&#8221; are located at <a href="http://www.cnlp.org/presentations/present.asp?show=conference">http://www.cnlp.org/presentations/present.asp?show=conference</a></p>
<p>Talk covered term vectors, using various query types and Lucene performance tips and tricks.</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2007/05/07/advance-lucene-slides-from-apachecon-europe-2007/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ApacheCon Europe &#8220;Advanced Lucene&#8221; slides</title>
		<link>http://lucene.grantingersoll.com/2007/05/03/apachecon-europe-advanced-lucene-slides/</link>
		<comments>http://lucene.grantingersoll.com/2007/05/03/apachecon-europe-advanced-lucene-slides/#comments</comments>
		<pubDate>Thu, 03 May 2007 13:18:29 +0000</pubDate>
		<dc:creator>grant_ingersoll</dc:creator>
				<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Europe]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[payloads]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[term vectors]]></category>

		<guid isPermaLink="false">http://lucene.grantingersoll.com/2007/05/03/apachecon-europe-advanced-lucene-slides/</guid>
		<description><![CDATA[My (slightly old) slides for ApacheCon Europe are now available in the conference proceedings available at http://eu.apachecon.com/downloads/materials.zip I will post the latest version soon, but there is very little difference between this version and the latest. Topics covered include Lucene performance, term vectors and query tips and tricks. Feedback is always welcome]]></description>
			<content:encoded><![CDATA[<p>My (slightly old) slides for ApacheCon Europe are now available in the conference proceedings available at <a href="http://eu.apachecon.com/downloads/materials.zip">http://eu.apachecon.com/downloads/materials.zip</a></p>
<p>I will post the latest version soon, but there is very little difference between this version and the latest.</p>
<p>Topics covered include Lucene performance, term vectors and query tips and tricks.</p>
<p>Feedback is always welcome</p>
]]></content:encoded>
			<wfw:commentRss>http://lucene.grantingersoll.com/2007/05/03/apachecon-europe-advanced-lucene-slides/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
