Archive for the 'Search' Category
I just posted a brief intro on getting started with Apache Lucene payloads on Lucid’s blog for those who are interested. Here’s the teaser:
Like Spans, payloads involve the position of terms, but go one step further. Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a [...]
August 5th, 2009 | Posted in Apache, Indexing, Lucene, Search, Solr | No Comments
Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation! And welcome to the Lucene project! Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]
November 13th, 2008 | Posted in Apache, Java, Lucene, Mahout, Manning, OpenNLP, Search, Solr, Taming Text, Tika, clustering, machine learning | 3 Comments
I often explain search and relevance at talks and training classes for Lucene and Solr. In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as TF-IDF) in light of either the vector space model or in terms of determining relevance.
The [...]
November 8th, 2008 | Posted in Lucene, Search, Solr, relevance | 3 Comments
What’s new with Apache Solr.
My latest article on Apache Solr, title “What’s New with Apache Solr” is now available over at IBM developerWorks. It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. “paid placement”), SolrJ and a variety of other pieces.
Hope it is helpful… Feel [...]
November 5th, 2008 | Posted in Indexing, Java, Lucene, Search, Solr, spell checking | 1 Comment
Just a quick reminder that there is just over one week left before Lucene Boot Camp at this year’s ApacheCon.
This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend Solr Boot Camp on the second day. [...]
October 23rd, 2008 | Posted in Apache, ApacheCon, Indexing, Java, Lucene, Lucene Boot Camp, Search, Solr | 4 Comments
I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr.
First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene. For those [...]
October 23rd, 2008 | Posted in Apache, Java, Lucene, Mahout, Manning, Search, Solr, Taming Text, clustering, machine learning, spell checking, term vectors, tokenization | 1 Comment
Lucene Boot Camp (ApacheCon site)
Lucene Boot Camp (http://www.lucenebootcamp.com) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans. This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not that [...]
August 20th, 2008 | Posted in Apache, ApacheCon, Indexing, Java, Lucene, Lucene Boot Camp, Lucid Imagination, Search | No Comments
Code Fury
The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I’m obviously a big fan of Lucene and also never much cared for MySql’s search (in)capabilities.
The plugin is easy enough to install, only thing that struck me [...]
August 7th, 2008 | Posted in Indexing, Lucene, MySQL, Search, spell checking, wpSearch | No Comments
[#LUCENE-1313] Ocean Realtime Search – ASF JIRA
Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search. This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it. Looks like Jason [...]
June 24th, 2008 | Posted in Apache, Lucene, Real Time Search, Search | No Comments
Jeff’s Search Engine Caffè
Copyright and distribution issues
Let’s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of [...]
May 22nd, 2008 | Posted in Lucene, Performance, Search, TREC, queries, relevance | No Comments