Open Source Search Relevance Follow Up

Jeff’s Search Engine Caffè
Copyright and distribution issues
Let’s say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that evolve over time. GOV2 is a static crawl of 25 million government documents and can therefore be distributed without too much hassle. Not to mention there is little to no spam. However, there’s a problem: commercial documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don’t know the answer to that question. Could could that volume of data even be distributed?

Right, we are not going to get into the distribution/copyright game.  We are going to focus on using collections that are freely available.  Each user would just be told what to download.

For example, we could do something like:

Have the user download a static version of Wikipedia from a specific date, index them however they see fit, then run a set of queries we develop and then rate the top 10 or 20 and post their results, including their actual implementation, which is always lacking other than the usual hand waving of saying “we did stemming and relevance feedback”.  We have the advantage in that we can say EXACTLY what we did, no question on implementation, so, gasp, others can repeat the exact experiments, like any good scientist does, before going on to improve it.   Then, when the next person comes along, they do the same thing.  If they disagree about the judgments for the same run, we have a discussion and one person convinces the other and we move on.   Next, someone will come along with a scoring improvement and post those results, and now people will know the current “best” algorithm for this set of data.

Lather, rinse, repeat for other collections, developed over time.  Any engine can submit, anybody can participate.  Open source at it’s best!

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image


vpn proxy vpn service