Archive for the 'Search' Category
I just posted a brief intro on getting started with Apache Lucene payloads on Lucid’s blog for those who are interested. Here’s the teaser: Like Spans, payloads involve the position of terms, but go one step further. Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a [...]
August 5th, 2009 | Posted in Apache, Indexing, Lucene, Search, Solr | No Comments
Congratulations to Apache Tika (nevermind the incubator address, it’s still in the process of migrating) for graduating from Incubation! And welcome to the Lucene project! Tika is a content extraction framework that wraps many other content extraction libraries such as PDFBox, POI, and others into a single, easy to use framework that makes it easy [...]
November 13th, 2008 | Posted in Apache, clustering, Java, Lucene, machine learning, Mahout, Manning, OpenNLP, Search, Solr, Taming Text, Tika | 3 Comments
I often explain search and relevance at talks and training classes for Lucene and Solr. In doing so, I often discuss the concepts of search term weighting and their typical instantiations via term frequency and inverse document frequency (abbreviated as TF-IDF) in light of either the vector space model or in terms of determining relevance. [...]
November 8th, 2008 | Posted in Lucene, relevance, Search, Solr | 3 Comments
What’s new with Apache Solr. My latest article on Apache Solr, title “What’s New with Apache Solr” is now available over at IBM developerWorks. It covers some of the new features like spell checking, Data Import Handler, distributed search, editorial results placement (a.k.a. “paid placement”), SolrJ and a variety of other pieces. Hope it is [...]
November 5th, 2008 | Posted in Indexing, Java, Lucene, Search, Solr, spell checking | 1 Comment
Just a quick reminder that there is just over one week left before Lucene Boot Camp at this year’s ApacheCon. This year, it is a 2 day training, but for those who want to, they can sign up for the first day of Lucene Boot Camp, and then attend Solr Boot Camp on the second [...]
October 23rd, 2008 | Posted in Apache, ApacheCon, Indexing, Java, Lucene, Lucene Boot Camp, Search, Solr | 4 Comments
I’ve had a chance recently to work on some things in Solr that I think that can, in the right circumstances, really enhance Solr. First off, is SOLR-651, which implements what I am calling a Term Vector Component. The basic gist of it is that Solr can now serve up term vectors from Lucene. For [...]
October 23rd, 2008 | Posted in Apache, clustering, Java, Lucene, machine learning, Mahout, Manning, Search, Solr, spell checking, Taming Text, term vectors, tokenization | 1 Comment
Lucene Boot Camp (ApacheCon site) Lucene Boot Camp (http://www.lucenebootcamp.com) is scheduled this year for ApacheCon US on November 3 and 4th in New Orleans. This year, I am doing a two day event, as I felt the one day event was just not enough time to get in all the goodness that is Lucene (not [...]
August 20th, 2008 | Posted in Apache, ApacheCon, Indexing, Java, Lucene, Lucene Boot Camp, Lucid Imagination, Search | No Comments
Code Fury The author of this nice plugin for WordPress contacted me today about his Lucene based WordPress plugin, so I thought I would give it a try, as I’m obviously a big fan of Lucene and also never much cared for MySql’s search (in)capabilities. The plugin is easy enough to install, only thing that [...]
August 7th, 2008 | Posted in Indexing, Lucene, MySQL, Search, spell checking, wpSearch | No Comments
[#LUCENE-1313] Ocean Realtime Search – ASF JIRA Jason Rutherglen has been up to some interesting things with Lucene lately concerning real time search. This has always been one of those parts of Lucene that has been needed over time by some people, but has never reached the critical mass whereby someone tackles it. Looks like [...]
June 24th, 2008 | Posted in Apache, Lucene, Real Time Search, Search | No Comments
Jeff’s Search Engine Caffè Copyright and distribution issues Let’s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static [...]
May 22nd, 2008 | Posted in Lucene, Performance, queries, relevance, Search, TREC | No Comments