Mahout News

Wow!  Mahout has just got me pumped up.  I feel like we’ve got a lot of positive momentum and that we are starting to get the various pieces of our suite of machine learning libraries in place.  Various news items include:

  1. Ted Dunning is now a committer!  Welcome Ted!
  2. I put up a patch for a map-reduce ready (MRR?) version of a Naive Bayes classifier.  Still needs some work, but feedback is appreciated.  It includes a sample of running it against the 20 Newsgroups data.
  3. Seems like their is someone new providing comments, etc. every day.  The mailing list continues to grow.
  4. We have slotted 3 Google Summer of Code participants and had many excellent proposals.  It was very difficult to decide.
  5. The Taste project is now officially accepted into Mahout and will be running in distributed mode in no time, I’m sure.
  6. Karl is working on a hierarchical clustering implementation, amongst other things.

We are also trying to obtain compute resources for committers to use such that we can do testing/benchmarking on a cluster as opposed to some hodge-podge of resources any individual has access to at any given point in time.  If you are in a position to donate cloud time, please drop me a line.

What I’ve been up to lately: Lucid Imagination

Lucid Imagination

Well, the cat is out of the bag.  In case you haven’t heard, a few Lucene/Solr/Mahout committers (Erik Hatcher and Yonik Seeley) and I have teamed up with some other long time search veterans (Marc Krellenstein from Northern Light and former CTO of Reed Elsevier, amongst others) to build a company around providing product, support and other services related to the Lucene family of technologies.  Obviously, this is something I am quite excited about and have thought about for a long time.  Both Lucene and Solr are quite robust and ready for the enterprise in a lot of ways, but need that extra push a company can give to make them polished.  Our goals are to  build a best of breed product (Lucid Focus™)based on them and enrich the open source community as well.  Another exciting thing is we plan to fully support both our product AND the open source versions of Lucene and Solr.

At any rate, enough of the marketing, if you’re interested in trying our beta or talking support, drop us an email at the address on our site.

Manning: Taming Text

Manning: Taming Text

Scary…  I guess it is real!

BarCampRDU

BarCamp wiki / BarCampRDU

Threw my name in the ring for BarCamp RDU today.  Haven’t been to BarCamp before, but Erik Hatcher suggested I go and check it out.

Also put in a Proposed Session of “Apache Mahout and Hadoop - Having fun with Map Reduce and distributed computing”.  Figure we talk about the basics of M/R, Hadoop and Mahout programming, look at some Mahout examples, run some code, maybe even go nuts and try setting up a distributed job across laptops just for the fun of it.  Was also thinking it might be fun to talk about Lucene/Solr, but figured one session was enough, especially since I am a BarCamp virgin.

Mahout Machine Learning Fun

It’s been an interesting few months over in Mahout land. First off, I am psyched about the response the project has been getting. Seems like there is a pent up demand for large scale machine learning these days.  I figured we would do all right in the early months, but I didn’t think we would have as many subscribers and participants as we do this early. Furthermore, the code contributions have started to come in and we had 15 or so applicants (for 2 or 3 spots) for our Google Summer of Code CFP.

Additionally, there were a fair number of inquiries about Mahout at ApacheCon EU and I got to meet Karl Wettin and Isabel Drost there (two Mahout committers). I went well over 3 years in Lucene Java land before meeting any of the other committers on Lucene.

Next, we added Jeff Eastman as a committer. Jeff has jumped in head first and is already helping out a lot.

Sean Owen has donated the Taste collaborative filtering project. We also made Sean a committer, and he is already contributing in other areas. We are still waiting to clear the legal hurdles here, but I think you will see Taste in the Mahout code base within the month.

Finally, it’s not official yet, but keep your eyes on Mahout for what should be another significant announcement of an NLP/ML project joining Mahout. I know, I know, I’m such a tease… ;-)

Why Lucene Isn’t That Good | Javalobby

Why Lucene Isn’t That Good | Javalobby

Patches welcome…  I know that is an old saw, but that is the only way it’s going to get better.

There are some good points in here, and some stuff that is a bit dramatic.

We do try to keep adapting Lucene and make it better, but in some respects we are damned if we do, damned if we don’t.  The whole abstract vs. interface debate has been going on for a long time on Lucene.  If we switched to interfaces, then people would be complaining constantly about how we break their code everytime we do a release if we introduce new methods.  If we leave things as abstract classes, then people like Cedric complain that Lucene is hard to extend.

As for the “final” declarations, the reason it works that way, is that we can’t see the future.  Often times, the things that are final now are legacy from back in the early days when we couldn’t imagine some of the uses that Lucene is now being used for, or as another commenter on the thread said, to avoid unintended consequences.  That’s why people submit patches and things get improved.  Sorry we can’t all work on Lucene nonstop all day.  If Lingway wants to hire me to do that, you know how to reach me! (at least, as a contractor, I’m not interested in leaving my current employment, but I do offer consulting.)

As for SpanQueries, yes they can be slower.  I’d love to have a discussion with someone like you who is a heavy user to see how to improve them.  Please send your profiling info ASAP to the java-dev mailing list!  Please don’t let all those hours you spent go to waste.  Even a half baked patch is a starting point.

I do agree about scoring being pluggable, but I gotta tell ya’, scoring is hard and not for the faint of heart.  It’s a whole other layer and doing it right means being fast and accurate and doing it wrong means deep, dark, scary rabbit holes where you don’t see light for days.  One of the simplest ways, however, to improve scoring is to change the length normalization.

As for some of the other “higher” features, like crawling/clustering, those are nice, but they don’t belong in the core of Lucene b/c not everyone needs them, although the number is increasing.  How many people have collections that go beyond 10-20M documents?  What would be nice, however is a contrib module or a layer above Lucene that provides all those nice things you want (you know, you can embed Solr in no time, by the way).  Lucene is meant to be really fast on one machine and to also play nice when you put in the appropriate distributed pieces.  It’s unfortunate that no one has donated the distributed piece yet (although Solr does now have it, thanks to Yonik!)

At any rate, thanks for the ideas.  Hope to see your patches soon!

Jeff Eastman’s Marvelous Cloud Computing Adventure

Jeff Eastman’s Marvelous Cloud Computing Adventure

Mahout’s newest committer, Jeff Eastman, has a new blog on Mahout and Hadoop…

SummerOfCode2008 - Looking for a summer project in Machine Learning?

SummerOfCode2008 - General Wiki

Check out the Apache Summer of Code page (link above) to see how you can spend the summer developing large scale machine learning algorithms and help out the Mahout project.  We’d love to have a few students put together a some projects implementing one or more machine learning algorithms using Hadoop.  So, if you are interested, or know someone who might be, give ‘em this information!

Mahout: k-means Clustering

I committed a first crack at k-means clustering to Mahout last night, thanks again to Jeff Eastman’s excellent work.  This means Mahout now has two clustering algorithms designed to run using Hadoop’s map reduce algorithm, meaning it should be able to scale up to very large data sets.

To learn more about k-means, see the Mahout wiki, specifically our page on k-means.

FeatherCast » Blog Archive » Episode 43: Lucene

FeatherCast » Blog Archive » Episode 43: Lucene

I did a FeatherCast today with Rich Bowen.  Dang, he is quick at editing…