Lucid Imagination » Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3)

Readers might find my first post on integrating Lucene/Solr and Mahout interesting:

Lucid Imagination » Integrating Apache Mahout with Apache Lucene and Solr – Part I (of 3).

Lucid Imagination » Free Webinar: Mastering Solr 1.4 with Yonik Seeley

Lucid Imagination » Free Webinar: Mastering Solr 1.4 with Yonik Seeley.

The title says it all.

When using open source makes you an enemy of the state | Technology | guardian.co.uk

When using open source makes you an enemy of the state | Technology | guardian.co.uk.

Scary to think that me choosing to give my work away could be considered against the law in some places.

TriJUG: Intro to Mahout Slides and Demo examples

First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night.  Also a big thank you to Red Hat for providing a most excellent meeting space.  Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle.  Overall, I think it went well, but that’s not for me to judge.  There were a lot of good questions and a good sized audience.

The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides (Intro Mahout (PDF)).

For the “ugly demos”, below is a history of the commands I ran for setup, etc.  Keep in mind that you can almost always run bin/mahout <COMMAND> –help to get syntax help for any given command.

Here’s the preliminary setup stuff I did:

  1. Get and preprocess the Reuters content per http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/
  2. Create the sequence files: bin/mahout seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8
  3. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
  4. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF

For Latent Dirichlet Allocation I then ran:

  1. ./mahout lda –input  <PATH>/content/reuters/seqfiles-TF/vectors/ –output  <PATH>/content/reuters/seqfiles-TF/lda-output –numWords 34000 –numTopics 20
  2. *./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 –dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output <PATH>/content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile

For K-Means Clustering I ran:

  1. ./mahout kmeans –input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans –clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters
  2. Print out the clusters: ./mahout clusterdump –seqFileDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-15/ –pointsDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/points/ –dictionary /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/dictionary.file-0 –dictionaryType sequencefile –substring 20

For Frequent Pattern Mining:

  1. Download http://fimi.cs.helsinki.fi/data/
  2. ./mahout fpg -i <PATH>/content/freqitemset/accidents.dat -o patterns -k 50 -method mapreduce -g 10 -regex [\ ]
  3. * ./mahout seqdump –seqFile patterns/fpgrowth/part-r-00000

Apache Mahout talk at Triangle Java User’s Group

For those who live in the Triangle, I’ll be giving an intro talk on Mahout next Monday.  See Welcome to the Triangle Java Users Group for more details.  Due note the location is no longer in RTP, but at the Red Hat campus at NCSU.

Hope to see you there!

FAST to Solr for *NIX

Hardly news anymore about MS dropping support for *NIX platforms, but I especially like J Lawson’s nice little quote about their switch from FAST to Apache Solr/Lucene:

We very quickly switched to Apache Lucene, Solr and are very happy with the result. Performance is good, and we have cut hosting costs by a staggering 400%.

via J Lawson’s Blog Space » 2010 » February.

I’ve seen likewise reductions (and sometimes more significant ones)  in other replacements I’ve been involved with at Lucid Imagination.  To me, the likely reason is the flexibility that Lucene/Solr provide for an application to model it’s domain as efficiently as needed.  Not too much, not too little instead of the “kitchen sink” approach which turns on every feature by default.  Too often, I think buyers are seduced by a really long feature list, even though their application may only need 80% of those features.  After all, would you pay extra for a car with heated seats if you lived in the tropics?  With Solr, you get rock solid search capabilities (no one disputes that, because all the vendors pretty much use the same model) and can easily turn on/off most of the other features.  Furthermore, if you need something not included (most of the time you don’t, b/c it’s already there), you can choose the best in breed implementation of that feature and integrate it.

At any rate, as Shalin said the other day, Apache L/S welcomes all FAST *NIX users.

Lucid Imagination » The Seven Deadly Sins of Solr

Props to Jay Hill on an excellent article on things to watch for when setting up Solr: Lucid Imagination » The Seven Deadly Sins of Solr.

Just posted on: Apache Lucene Connector Framework now in Incubation at the ASF

I just put up some initial info on the new Apache Lucene Connector Framework project that is now in ASF Incubation.  See Lucid Imagination » Apache Lucene Connector Framework now in Incubation at the ASF.

Measuring Measures: Learning About Statistical Learning

All you Mahout’s out there might find some background help in Bradford Cross’ blog post: Measuring Measures: Learning About Statistical Learning.

Spatial Search Article is Live

My latest article is up at IBM’s developerWorks on spatial search with Lucene and Solr.  Have a look at: Location-aware search with Apache Lucene and Solr.