Archive for the 'Performance' Category

Assumptions (in Apache Lucene and Solr and pretty much everything else) Considered Harmful

I had a Football (American Football, that is, not soccer) coach who always used to drill into our heads what happens when one assumes something about our opponent for that week; he’d get all worked up, hoist up his coaching shorts (you know the ones, they should be banned…), puff out his chest, give you [...]

Lucid Imagination » Understanding Lucene Performance – Free online workshop

Andrezej Bialecki is giving a webinar for Lucid on Apache Lucene performance on Thursday.  More info is available at:
Lucid Imagination » Understanding Lucene Performance – Free online workshop.

Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel

Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel.
Daniel Tunkelang has written up an interesting post on the new Open Relevance Project that me and a few other Lucene people are starting up and I thought I would respond here:
Little late to the conversation, but I think maybe we should back [...]

Solr 1.3.0 Released

Apache Solr 1.3.0 has been released.  This version contains many, many improvements and bug fixes.  High on my list are things like a good first step on distributed search support, integrated spell checking, support for Lucene’s “More Like This”, and the much needed Data Import Handler.  Of course, one can’t forget about the numerous performance [...]

Text Processing: Why Servers Choke : Beyond Search

Text Processing: Why Servers Choke : Beyond Search
If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.
Are we reading the same paper?  This hardly says Lucene is a slow horse in the race.  What it [...]

Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)

Apache Hadoop Wins Terabyte Sort Benchmark (Hadoop and Distributed Computing at Yahoo!)
Congrats to the Hadoop team!  Score one for Open Source!

Open Source Search Relevance Follow Up

Jeff’s Search Engine Caffè
Copyright and distribution issues
Let’s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of [...]

Open Source Search Engine Relevance

For a while now, I have been trying to get my hands on TREC data for the Lucene project.  For those who aren’t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an [...]

FeatherCast » Blog Archive » Episode 43: Lucene

FeatherCast » Blog Archive » Episode 43: Lucene
I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing…

Yahoo Search Wants to Be More Like Google, Embraces Hadoop

Yahoo Search Wants to Be More Like Google, Embraces Hadoop
Hadoop is an open-source implementation of Google’s MapReduce software and file system. It takes all the links on the Web found by a search engine’s crawlers and “reduces” them to a map of the Web so that ranking algorithms can be run against them.
Ahem, Hadoop [...]