In case you haven’t heard, and are in Europe this June (or want to be), you should check out the Berlin Buzzwords conference. It’s a great conference for all things related to Lucene, Solr, Hadoop, Mahout, NoSQL and generally scaling. The CFP is open now through March 11.
January 18th, 2012 | Posted in Lucene, Mahout, Solr | No Comments

Drew, Tom and I are feverishly working away on finishing up Taming Text. We are currently in the process of addressing the feedback we got from our final review and should have updates up soon. I have also posted all of the book’s source code up on Github under the Taming Text user. The source includes, amongst other things, a simple Question Answering system using Solr and OpenNLP, as well as analyzers for Lucene that use OpenNLP for sentence detection, part of speech tagging and Named Entity Recognition. As with most books, these examples are meant to be just that, examples.
December 27th, 2011 | Posted in Lucene, OpenNLP, Solr, Taming Text | No Comments
I’ve posted my review of “Mahout in Action” on Lucid’s website: Mahout in Action Review.
October 15th, 2011 | Posted in Mahout | No Comments
For those who have wanted other scoring models in Lucene/Solr (Okapi, others) more details can be found on Lucid’s blog: Lucid Imagination » Flexible ranking in Lucene 4.
September 12th, 2011 | Posted in Lucene | No Comments
Just ordered “R in Action” from Manning. Looking forward to learning more about it, as it comes up often when discussing solving smaller problems that what is appropriate for Apache Mahout. Hopefully, I will have time to post a review in the coming weeks.
September 2nd, 2011 | Posted in Lucene | 1 Comment
Triangle Hadoop Users Group, Next Meeting: Sept. 13 @ Bronto Software.
Ted Dunning of Mahout fame will be speaking at the next TriHUG meeting on MapR and it’s relationship with Hadoop, etc.
August 28th, 2011 | Posted in TriHUG | No Comments
It’s that time of year again: time to vote for SXSW talks. Last year I did a talk with RC Johnson of BazaarVoice on Solr as NoSQL, this year I thought I would try to fly solo and submitted a talk on Apache Mahout.
So, if you are so inclined to do the whole crowdsourcing thing, please go vote for my talk at SXSW 2012 – Apache Mahout: Bringing Intelligence to Your App and then maybe I will see you at SXSW in 2012.
August 15th, 2011 | Posted in Hadoop, machine learning, Mahout, Map Reduce | No Comments
After some time away, I’m happy to have had some time recently to work on Mahout again. Lots of goodness all over the place happening there that I’ll leave to others to explain while I focus in on a few recent things I’ve been doing.
First off, I was doing a fair amount of work calculating document similarities across whole collections using, at first, the RowSimilarityJob and later a map-side simplification I wrote that uses the distributed cache called the VectorDistanceSimilarityJob. Both of these come in handy when one wants to calculate pairwise-similarity between all (or most) items in a collection. The original Mahout implementation was focused on providing recommendations, but as outlined in the Elsayed, Lin and Oard paper, it is quite useful for text as well in cases where one wants to precompute “more like this” for all documents. As for the need for two similar approaches, see the discussion at http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7. In essence, it boils down to I didn’t need a fully generic implementation that was a bit slower on larger matrices since I mainly wanted to compare all my vectors in HDFS against a subset of “core” vectors that fit into memory. That being said, Sebastian is already hard at work on making the more generic version perform better when certain distance measures are used while still offering the full suite of capabilities of the existing RowSimilarityJob. See MAHOUT-767 for more info on that work.
Now, I’m looking into some more pruning techniques via MAHOUT-688. After that quick patch, I think I’m going to dig in a bit more to recommendations as well as run some tests on the ASF mail archives I posted a while back (see below for an update).
Also, I’ve switched to using Git and Github for managing my Mahout changes (as well as other work), so if you want to see what I’m up to, check out my Github account.
It’s not complete yet, but the ASF Public Mail archive I put up last September on Amazon AWS is getting a fresh new version. The interim solution is available at https://s3.amazonaws.com/asf-mail-archives-7-18-2011/index.html, but look for it to be a Public Data Set hosted by Amazon soon. The September version of this data contained roughly 6.7M emails sent to the public mailing lists at the Apache Software Foundation, so I suspect this version has somewhere in the 7M+ item range, but I haven’t counted them. At any rate, I hope it is useful to people.
Finally, on a personal note, I’m back at Lucid Imagination after a brief move elsewhere, this time in a new role as Chief Scientist. Lucid is a company I co-founded and helped build up for the past 4 years. I’m looking forward to be back working closely with Lucene and Solr again and a top notch technical team. I’m also looking forward to working on Mahout more, as well as other technologies like Hadoop, Pig, HBase and the like, especially as they relate to search and recommendations.
August 5th, 2011 | Posted in Hadoop, Lucene, Mahout, Solr | No Comments
If you’re interested in working on large scale problems like Apache Hadoop, Lucene, Solr, Mahout, Cassandra, etc. and you live in the Raleigh/Durham/Chapel Hill or greater NC area, then you might be interested in the upcoming Scale-A-Thon event that several of us from the Triangle Hadoop User’s Group are putting on June 18th at Bronto Software. Check out our website for more information and to register!
May 31st, 2011 | Posted in Hadoop | No Comments