Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel
Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel.
Daniel Tunkelang has written up an interesting post on the new Open Relevance Project that me and a few other Lucene people are starting up and I thought I would respond here:
Little late to the conversation, but I think maybe we should back up a little bit. I like a lot of the comments and wish they were actually made on general@lucene.apache.org where we are discussing the merits of the undertaking (see http://www.lucidimagination.com/search/document/76d7cdeed4882397) not that I expect that to happen given the way blogs work. At any rate, I’d like to add my two cents as the one who started the thread on general@lucene.apache.org.
First off, the ORP is VERY early stage brainstorming. ORP really doesn’t warrant much attention at this point and it is premature to even speculate about how it relates to TREC, Google, Yahoo!, Microsoft or anything else. I’m not even sure it has enough support to be a viable Lucene subproject! For now, I think most of us who are actually working on the genesis of the project are merely looking for a means to improve Lucene (and also Solr, Nutch and Mahout), despite what Otis says in his blog post about having grander notions for comparing across engines.
So, to the background…
This (ORP) is something I’ve been thinking about for a long time now and have discussed with a number of people in the past. The motivation comes from my frustration over the years in not being able to obtain data that everyone on Lucene can use without limitations, since I’ve almost always worked in places that had little money to spend on this kind of thing.
See http://www.lucidimagination.com/search/document/656d5ca50c8c9242, http://lucene.grantingersoll.com/category/trec/ and http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/ for background. The second motivation is simply to have practical, real world data driven by actual users.In the past, I have talked with both NIST and Sheffield to try to work out terms by which the Lucene community could obtain TREC resources, but the licensing terms simply prevent a totally free redistribution. (BTW, this is not NIST/Sheffield’s fault, but the company that allows them to use the data. NIST/Sheffield are doing the best they can given their constraints.) I have also talked with a few commercial companies that redistribute data (blogs, etc.) all to no avail (it’s usually the copyright that kills it.) If the ASF were to buy the dataset, we could distribute to the committers on Lucene according to the licensing terms, but not to the broader community and we’d have to maintain a list of who has it, etc. See the terms here, for instance. Since many of the best ideas come from the community in Open Source and you never know when and where they come from, I deemed this unacceptable and decided not to pursue it even though the ASF authorized me to go forward with it (i.e. spend the money) if I wanted to. After all, it is only a few hundred bucks.
To me, it is vital that there be an open and FREE means for doing relevance tests that the Lucene community can use to improve itself. If others can benefit, so be it. Much like Lucene developed a benchmarking tool for people to share performance tests (both speed and relevance) in a straightforward way (see the contrib/benchmark section of the Lucene distribution), so to is there a need for us (speaking, unofficially, for Lucene) to talk about relevance in a public way so we can compare notes just as any two researchers buried in the bowels of a commercial company might compare notes. Many, many people have used Lucene to do TREC (in fact, I have), but it is a showstopper when the other person you are discussing relevance with can’t just pick up the exact same bits (corpus, queries, judgments) and run the exact same tests. In other words, the goal is not to compare competing offerings, IMO, (although it will likely happen b/c that is human nature) it is to give Lucene users a common way of evaluating and talking about relevance.
As anyone familiar with Lucene knows, ORP will be driven by the people that show up and volunteer to contribute to it, as are all Apache projects. Thus, the slate really is clean. If anyone (and I truly mean anyone, not just Lucene users, even though that is the preliminary focus) is interested, please show up and discuss over at general@lucene.apache.org. We’d welcome the ideas and, moreover, any efforts.




Posts that got the discussion going and that Grant is referring to:
http://www.jroller.com/otis/entry/open_relevance_project
http://www.jroller.com/otis/entry/followup_open_relevance_project
Grant, I’m sorry that the discussion is spread over so many posts. My post was a response to Otis, and has triggered a long comment thread. I wish there were a way to merge comment threads across multiple related posts.
Anyway, as the historical records show, I’m not a fan of batch relevance testing at the query level. But I know that many people are, and it sounds like the ORP could serve as a Lucene-specific (and free) TREC, serving the needs of such people in the Lucene community. I wish you success in that endeavor, and I apologize if the cascade of extrapolations led to unnecessary confusion.