TriJUG: Intro to Mahout Slides and Demo examples

First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night.  Also a big thank you to Red Hat for providing a most excellent meeting space.  Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle.  Overall, I think it went well, but that’s not for me to judge.  There were a lot of good questions and a good sized audience.

The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides (Intro Mahout (PDF)).

For the “ugly demos”, below is a history of the commands I ran for setup, etc.  Keep in mind that you can almost always run bin/mahout <COMMAND> –help to get syntax help for any given command.

Here’s the preliminary setup stuff I did:

  1. Get and preprocess the Reuters content per http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/
  2. Create the sequence files: bin/mahout seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8
  3. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
  4. Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF

For Latent Dirichlet Allocation I then ran:

  1. ./mahout lda –input  <PATH>/content/reuters/seqfiles-TF/vectors/ –output  <PATH>/content/reuters/seqfiles-TF/lda-output –numWords 34000 –numTopics 20
  2. *./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 –dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output <PATH>/content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile

For K-Means Clustering I ran:

  1. ./mahout kmeans –input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans –clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters
  2. Print out the clusters: ./mahout clusterdump –seqFileDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-15/ –pointsDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/points/ –dictionary /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/dictionary.file-0 –dictionaryType sequencefile –substring 20

For Frequent Pattern Mining:

  1. Download http://fimi.cs.helsinki.fi/data/
  2. ./mahout fpg -i <PATH>/content/freqitemset/accidents.dat -o patterns -k 50 -method mapreduce -g 10 -regex [\ ]
  3. * ./mahout seqdump –seqFile patterns/fpgrowth/part-r-00000

9 Responses to “TriJUG: Intro to Mahout Slides and Demo examples”

  1. Grant, I can not download slides (404), is it just me?

  2. it’s happening to me as well.

  3. ok.. it’s a typo in the link
    try http://lucene.grantingersoll.com/wp-content/uploads/2010/02/intro-mahout.pptx

  4. thnx, cool

  5. When I do step 2:
    # Create the sequence files: bin/mahout seqdirectory –input /content/reuters/reuters-out –output /content/reuters/seqfiles –charset UTF-8

    I get this error
    $~/src/java/mahout/bin/mahout seqdirectory -input ~/src/java/reuters-mahout/data/reuters-out/ -output ~/src/java/reuters-mahout/data/seqfiles –charset UTF-8
    Exception in thread “main” org.apache.commons.cli2.OptionException: Unexpected -input while processing Options
    at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
    at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:205)

    I also tried using “–”
    ~/src/java/mahout/bin/mahout seqdirectory –input ~/src/java/reuters-mahout/data/reuters-out/ –output ~/src/java/reuters-mahout/data/seqfiles –charset UTF-8
    Exception in thread “main” org.apache.commons.cli2.OptionException: Unexpected –input while processing Options
    at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
    at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:205)

  6. It should be –input (two dashes) if that helps at all.

  7. [...] ranging from clustering to classification and collaborative filtering.  For more on Mahout, see my TriJUG talk or my developerWorks article.  Instead of going over the litany of things implemented in Mahout, [...]

  8. Dear Grant,

    I got an exception when trying mahout 0.3 locally at step 2 in LDA test:

    CMD is:
    ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input /content/reuters/seqfiles-TF/lda-output/state-19 –dict /content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output /content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile

    And the warning & exception is:

    WARNING: No org.apache.mahout.clustering.lda.LDAPrintTopics.props found on classpath, will use command-line arguments only
    May 20, 2010 10:13:27 PM org.slf4j.impl.JCLLoggerAdapter error
    SEVERE: MahoutDriver failed with args: [--input, tf_sparse_seq/lda/state-0, --dict, tf_sparse_seq/dictionary.file-0, --words, 10, --output, tf_sparse_seq/lda/topics, --dictionaryType, sequencefile, null]
    31659
    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 31659
    at java.util.Arrays$ArrayList.get(Arrays.java:3393)
    at org.apache.mahout.clustering.lda.LDAPrintTopics.topWordsForTopics(LDAPrintTopics.java:214)
    at org.apache.mahout.clustering.lda.LDAPrintTopics.main(LDAPrintTopics.java:153)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)

    And I found that *LDAPrintTopics.props* is indeed not in the $MAHOUT_HOME/conf directory. How should I solve the problem?

    Thank you!

  9. [...] here you can find some more code and some background info at metaoptimize Explore posts in the same [...]

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image