Apache Mahout Status
It’s been a while since I’ve said much about Mahout, but it seems like things are trending upwards. From a subjective standpoint, it just feels like there is more going on both on mahout-user@lucene.a.o and mahout-dev@lucene.a.o. It also feels like people are starting to kick the tires, which is vital. Furthermore, it also seems like we’ve gotten a few more contributors as late, which is, of course, key for any open source project as well. The coming months will be really important for us existing contributors to make sure we help newcomers have a good experience and address their issues.
In looking at the mailing list subscriptions, the user list is over 300 subscribers now (I believe a few months ago it was just under 250) and the dev list is up a little bit, although not a lot. Of course, subscription rates come and go, but we’ve been pretty solid around these numbers for a while, which is a good sign that people are still interested, IMO.
From a personal standpoint, I’ve had a bit more time to work on it, which gets me jazzed up. Right now, I’m working on getting some examples up for clustering documents (see https://issues.apache.org/jira/browse/MAHOUT-126 and https://issues.apache.org/jira/browse/MAHOUT-65), collaborative filtering (using Taste in Mahout) and categorization. The clustering stuff is likely still pretty naive on my part, but it’s a start. The clustering and categorization work will also feed into my book Taming Text.
We also have several Google Summer of Code students involved, which is always enjoyable and a learning experience. I even got to meet the student I’m mentoring (David Hall) in person this year, which was pretty cool. David is implementing Latent Dirichlet Allocation on MapReduce for Mahout. I’m not sure I understand it all just yet, but I trust David will make it clear to me by the end of the summer.
Speaking of meeting people, at that same meetup I finally got to meet fellow committers Jeff Eastman and Ted Dunning. Always nice to put names to faces, having worked with them on Mahout now for well over a year.




Who’s mentoring whom?
Picking up a volunteer student like David Hall’s awesome — I loved the paper that came out of his undergrad thesis on applying LDA to the ACL Anthology corpus.
There are two aspects to scaling LDA — number of documents, and number of topics. The documents can be parallelized to some extent depending on the algorithm. For scaling number of topics in a sampling context, check out:
I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, M. Welling. “Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation.” ACM Knowledge Discovery and Data Mining (KDD), 2008.
http://www.ics.uci.edu/~asuncion/pubs/KDD_08.pdf
I had this experience like David’s likely to get when I went to SpeechWorks — I’d been a professor who released software and I knew algorithms very well, I just hadn’t had practical professional programming experience. I learned an incredible amount from cowokers who would’ve been my students had they bothered to go to grad school.
Yeah, I have already learned a lot from David! I’m often in the other boat: I’ve implemented a lot of the ideas, but don’t always know the theory. Mahout has been a great learning experience for me already. Hopefully, I can pass on what I know about communities and open source. In fact, learning more about ML was precisely one of my goals in starting the project. My main NLP background was in rule-based systems, but it was pretty obvious to me that ML approaches were the growing trend, so Mahout helps me learn and apply.
[...] nur in der Mailingliste (Apache Mahout Status by Grant Ingersoll) des Projekts nimmt der Traffic zu, auch erste Projekte die Mahout erfolgreich einsetzen werden [...]
[...] Apart from three new additions to the code base, summer also brought quite some traffic to the user list – not only in terms of subscriptions but also in terms of developers contributing to the discussions online. Currently, it looks like the project is really gaining momentum, as also noted in Grant Ingersoll’s post. [...]