Open Source Search Engine Relevance
For a while now, I have been trying to get my hands on TREC data for the Lucene project. For those who aren’t familiar, TREC is an annual competition for search engines that provides a common set of documents to index, queries to execute and judgments to check your answers to see how good an engine performs. While it isn’t the be all, end all for relevance, it is a pretty good sanity check on how you are doing. For instance, many search engines do OK out of the box on it, but once you tune them, they can do much better. Of course, you risk overtuning to TREC as well.
In TREC, the queries and the judgments are provided for free, but one has to pay for the data, or at least most of it, since it is usually owned by Reuters or some other organization. It isn’t expensive or anything, but it is a barrier none the less, especially for an open source project. Furthermore, the whole notion of paying for data in this day and age of open source and Creative Commons just doesn’t sit right with me. Don’t get me wrong, I’m a big fan of TREC, having participated in the past, it provides a valuable service to the proprietary/academic IR community.
So, what does this have to do with Lucene? When I say I am trying to get my hands on TREC data, I don’t mean just for me, I literally mean obtaining TREC data for Lucene. That is, I want the data to be made available, ideally, for all Lucene (and, for that matter, all open source search engine) users to use and run experiments on so as to spur on innovation in Lucene’s scoring algorithms, etc. Now, I know the copyright owners will never allow this, as I have asked. So, my next thought was let’s just get it for internal use by committers at Apache. So, I went back to TREC and we have an agreement to do this, more or less. The problem, however, is that they say we can only use the data on ASF (Apache) machines. Not a big deal, right? Kind of. The ASF doesn’t really have the hardware to run TREC style experiments. We pretty much have one Solaris “zone” alloted us (a “zone” is a virtual machine guest image running.) Furthermore, the ASF is pretty much an all volunteer, worldwide distributed organization. We do almost all of our work on our own machines as VOLUNTEERS. Practically speaking, the best way for any of us to take advantage of the data is to have it locally, which I am told, isn’t going to happen.
So, what’s the point? I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.
Practically speaking, I don’t think we even need to go as deep as TREC. I think we would find the most use in making judgments on the top 10 or 20 results for any given query.
So, what do others think? Am I off my rocker? Are there any volunteers out there? I think we could do this pretty simply through some scripts, and the effective use of a wiki. I don’t think our goal is, in the short run, to be scientifically rigorous, but it should be over time. Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them. (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.) Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.
FWIW, there is probably some alignment with the Wikia search project.




Grant, sounds like a great idea! Wikipedia data is available at http://stats.grok.se/, although that seems to be the page titles, not the user’s original queries.
Hi Grant,
Is the “only on ASF machines” restriction a data storage restriction, or a data processing restriction?
If it’s the former, then there would be obvious work-arounds.
And in either case, yes it would be great to have data to use for relevance research with Lucene and other open source search engines.
– Ken
The restriction, as I understand it, is one of distribution. Taking the docs off of a “ASF machine” implies distribution and is a violation of the terms of the agreement. We can copy it to machines where the ASF is responsible for setting up accounts, or the accounts are based on the ASF credentials
How do you plan to use TREC data after the eval? The data consists of evaluations of relevant/irrelevant/maybe-relevant for the top K (usually K = 100 or 200) results for each of the participants. This gives you exact precision-at-K for submitted systems, but doesn’t let you evaluate recall or even precision-at-K for systems developed after the eval.
I don’t know that we need to use TREC data after the eval. I think I am proposing something that is completely open and repeatable by anyone using freely available data. Download X corpus, get queries from the project site, run them and make judgments on top 10-20 and post them/edit a wiki, lather, rinse, repeat.
It isn’t fully thought out as to how it all works. I’m just saying there is a need for some type of open source relevance project. It goes beyond Lucene, in my mind, but I’m happy to start it here.
I am no doubt sure there are smarter people out there than me that can figure out the details of how to make this be rigorous. I just want to see if I can kickstart the discussion and get something going.
I don’t think it is the ‘data’ component that makes this difficult, but the analysis of which results are better.
Ideally you would need to use some kind of crowdsourcing to judge which result is more relevant. (ie 95% of users click on X when they search for ‘moo’).
The problem then becomes one of privacy. you need to approach a organization to produce a anonymous version of their query logs (which is hard).
If you could convince wikipedia to just release the search query, and the result clicked you could then use this as a basis of judgement.
I agree, Ian, but again, not sure if we need that data. As I understand TREC, they have analysts who read documents, etc. and come up with queries knowing somewhat that there are answers available.
I think we could have people come up with queries and then we make human judgments just as any organization would do in house.
The bigger question to me now, is this something people are interested in doing? That is, would people be interested in starting/joining a “Lucene Relevance” project as a subproject of Lucene? We can figure out the details from there.
Grant – Interesting ideas!
What use cases should the collections be designed for? Things that come to mind: enterprise search, web search, product search, etc…
Also, what about interactive retrieval scenarios? This is one of the major drawbacks to TREC-like evaluations. Pooling also has its limitations (see the recent Terabyte Track summaries).
Could a possible solution be a platform like Alexa’s Web Search Platform? The documents would be hosted on the cluster and not distributed (getting around copyright and distribution problems), but could be processed to create search indexes. You could even create a working service and collect real queries. Who knows, you could even before A/B relevance tests on the best systems using live traffic.
See my blog for more thoughts.
I’ve been using http://www.ClassEngine.com for testing search results relevance. I suggest everybody to try, it works perfectly.
For example,
test search engines search results relevancy for Las Vegas Hotels in ask.com
http://www.ask.com/web?q=las+vegas+hotels