Opening up Academic Research on IR and Machine Learning

Kudo’s to Dr. Ted Pedersen for finally saying out loud (in the latest issue of Computational Linguistics, thanks to Bob Carpenter for the pointer) what I’ve long thought about academic publications on topics like information retrieval and machine learning:  namely, publications of empirical results in software systems without publishing the software is a disservice to the community at best, and pointless at it’s worst.  It hinders learning and it hinders the furthering of the field.  It accounts for a good chunk of the reason I started the Mahout project (now we can say “download Mahout and run X”), partially explains why I’m writing Taming Text and Paper of the Week (now we can say “here’s how this stuff really works in practice”) and also why I wanted Lucene to have a built-in benchmarking tool where people can publish their configurations so others can try them out and repeat them.

If you ever read papers in this field, you quickly notice they all have this nice theory and they make all these nice (grand?) claims, but at the end of the day, 99% of the interested population can’t reproduce them because they don’t have the software to do it.  As Dr. Pedersen also points out, often times the creator can’t even reproduce it.  Even worse, they do publish the software, but it is abandoned, or undocumented, and who knows what settings are required: the Professor has moved on (”I got a new grant!”), the Grad students have moved on (”I got my PHd.”) and most importantly, the funding has moved on (Gov’t Program Manager: “I had another successful project, time for a bigger budget!”).  Sorry for the cynicism…

In fact, I think one thing Mahout can really offer the likes of Researchers is that you can focus on the big ideas, and we’ll take care of making sure your prototype is scalable, documented and maintained!  Besides, wouldn’t you rather be a part of seeing people actually use your work instead of it just living on some piece of paper or locked away in your hard drive collecting cosmic dust?

4 Responses to “Opening up Academic Research on IR and Machine Learning”

  1. The tools that are easy to use get widely re-used. For instance, Adwait Ratnaparkhi’s POS tagger or Collins’ parser or the Stanford NE tagger, which present usable tools out of the box that run fro the command line. Then there are tools like Joachims’ SVMLight or Bottou’s SGD that are easy to use from the command line.

    There’s a big problem in recreating all the features, because papers usually only contain sketches of what’s used. And that’s where the accuracy comes from on actual tasks. And there’s also a big problem recreating heuristic pre- and post-processing. And then there’s patching together whole pipelines using instances of various resources like Wikipedia, Wordnet, and so on.

    The big problem for sharing is scale. If I run over a terabyte of web data, or even over 50 GB, how do I share that? TREC has done things like distribute disk drives on their “large” scale tasks, which aren’t even large scale these days.

  2. Yeah, I totally agree on your points, Bob. The data problem is particularly true. It’s interesting, because if I was a physicist, I would be required to publish the steps I took in order to reproduce an experiment. In our world, we just put up a formula or two, maybe a nice description and some pseudocode and then wave our hands and magically we get a nice, pretty, well-formatted table of results showing a 10% increase in mean average precision or some great F-measure. Pay no attention to the man behind the curtain.

    Sharing the code and configuration is a good starting point.

  3. I didn’t do any research in Mahout yet, but I think Apache UIMA framework is a good example in how to provide a common platform.

    We started our development on a niche text processing software 5 years ago. Thats when we got into IBM’s UIMA framework and integrated it into our software right in the beginning.
    Beside integrating our own text analysis methods we were looking a lot into university paper publications which were sometimes either hard to understand and not even practical at all.

    This changed couple years ago when I saw more and more folks on universities starting to work with UIMA as a platform. They even started exchanging components each other. On a congress about UIMA last year I saw 8 project presentations of different UIMA projects, 6 of them from university, 2 of them from the industry (1 from the netherland, one from germany)

    Over here in germany it looks like the university research in text processing is already ahead of the industry and the industry have to get into to benefit from research.

    I will do my Mahout homework this weekend.

  4. P.S.: Just bought your book and looked into the early access papers, congratulations!

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image


proxy surf proxy mp3 site mp3 sites Allofmp3 Mp3 fiesta Mp3fiesta buy mp3 music vpn usa vpn uk vpn vpn account watch usa tv