February 24th, 2010 | Posted in Solr | No Comments
When using open source makes you an enemy of the state | Technology | guardian.co.uk.
Scary to think that me choosing to give my work away could be considered against the law in some places.
February 24th, 2010 | Posted in Lucene | No Comments
First off, big thank you to TriJUG and all the attendees for allowing me to present Apache Mahout last night. Also a big thank you to Red Hat for providing a most excellent meeting space. Finally, to Manning Publications for providing vouchers for Taming Text and Mahout In Action for the end of the night raffle. Overall, I think it went well, but that’s not for me to judge. There were a lot of good questions and a good sized audience.
The slides for the Monday, Feb. 15 TriJUG talk are at: Intro to Mahout Slides (Intro Mahout (PDF)).
For the “ugly demos”, below is a history of the commands I ran for setup, etc. Keep in mind that you can almost always run bin/mahout <COMMAND> –help to get syntax help for any given command.
Here’s the preliminary setup stuff I did:
- Get and preprocess the Reuters content per http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/
- Create the sequence files: bin/mahout seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8
- Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF weight (for LDA): bin/mahout seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
- Convert the Sequence Files to Sparse Vectors, using the Euclidean norm and the TF-IDF weight (for Clustering): bin/mahout seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
For Latent Dirichlet Allocation I then ran:
- ./mahout lda –input <PATH>/content/reuters/seqfiles-TF/vectors/ –output <PATH>/content/reuters/seqfiles-TF/lda-output –numWords 34000 –numTopics 20
-

./mahout org.apache.mahout.clustering.lda.LDAPrintTopics –input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 –dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 –words 10 –output <PATH>/content/reuters/seqfiles-TF/lda-output/topics –dictionaryType sequencefile
For K-Means Clustering I ran:
- ./mahout kmeans –input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 –k 15 –output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans –clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters
- Print out the clusters: ./mahout clusterdump –seqFileDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-15/ –pointsDir /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/output-kmeans/points/ –dictionary /Volumes/Content/grantingersoll/content/reuters/seqfiles-TFIDF/dictionary.file-0 –dictionaryType sequencefile –substring 20
For Frequent Pattern Mining:
- Download http://fimi.cs.helsinki.fi/data/
- ./mahout fpg -i <PATH>/content/freqitemset/accidents.dat -o patterns -k 50 -method mapreduce -g 10 -regex [\ ]
-

./mahout seqdump –seqFile patterns/fpgrowth/part-r-00000
February 16th, 2010 | Posted in Lucene, Mahout | 7 Comments
For those who live in the Triangle, I’ll be giving an intro talk on Mahout next Monday. See Welcome to the Triangle Java Users Group for more details. Due note the location is no longer in RTP, but at the Red Hat campus at NCSU.
Hope to see you there!
February 9th, 2010 | Posted in Apache, Cary, Chapel Hill, Durham, Hadoop, Java, Mahout, North Carolina, Raleigh, Triangle | No Comments
Hardly news anymore about MS dropping support for *NIX platforms, but I especially like J Lawson’s nice little quote about their switch from FAST to Apache Solr/Lucene:
We very quickly switched to Apache Lucene, Solr and are very happy with the result. Performance is good, and we have cut hosting costs by a staggering 400%.
via J Lawson’s Blog Space » 2010 » February.
I’ve seen likewise reductions (and sometimes more significant ones) in other replacements I’ve been involved with at Lucid Imagination. To me, the likely reason is the flexibility that Lucene/Solr provide for an application to model it’s domain as efficiently as needed. Not too much, not too little instead of the “kitchen sink” approach which turns on every feature by default. Too often, I think buyers are seduced by a really long feature list, even though their application may only need 80% of those features. After all, would you pay extra for a car with heated seats if you lived in the tropics? With Solr, you get rock solid search capabilities (no one disputes that, because all the vendors pretty much use the same model) and can easily turn on/off most of the other features. Furthermore, if you need something not included (most of the time you don’t, b/c it’s already there), you can choose the best in breed implementation of that feature and integrate it.
At any rate, as Shalin said the other day, Apache L/S welcomes all FAST *NIX users.
February 8th, 2010 | Posted in Apache, Lucene, Solr | 4 Comments
Props to Jay Hill on an excellent article on things to watch for when setting up Solr: Lucid Imagination » The Seven Deadly Sins of Solr.
January 22nd, 2010 | Posted in Lucid Imagination, Solr | No Comments
I just put up some initial info on the new Apache Lucene Connector Framework project that is now in ASF Incubation. See Lucid Imagination » Apache Lucene Connector Framework now in Incubation at the ASF.
January 20th, 2010 | Posted in Apache, Lucene, Lucene Connector Framework, Solr | No Comments
All you Mahout’s out there might find some background help in Bradford Cross’ blog post: Measuring Measures: Learning About Statistical Learning.
January 16th, 2010 | Posted in Mahout | No Comments
My latest article is up at IBM’s developerWorks on spatial search with Lucene and Solr. Have a look at: Location-aware search with Apache Lucene and Solr.
January 12th, 2010 | Posted in Apache, Lucene, Solr | 2 Comments