Interview with Ian Holsman of Relegence | Enterprise Search support for Apache Lucene and Solr by Lucid Imagination.
Had a really a great conversation with Ian about AOL/Relegence’s use of Solr. Topics range from technical details to business level discussions. Really cool to hear how Solr is powering a large amount of AOL’s search capabilities. Tens of millions of queries a day. High availability, etc. etc.
Check it out and give me some feedback.
June 30th, 2009 | Posted in Apache, Lucene, Solr | No Comments
Welcome to the Open Relevance Project!.
I finally got around to putting the Open Relevance Project website up. Click the link above to check it out. Be forewarned, it is barebones. Patches welcome, although I suspect most of the work will be on the Wiki once the ASF infrastructure gets that setup.
June 25th, 2009 | Posted in Lucene, Open Relevance | No Comments
It’s been a while since I’ve said much about Mahout, but it seems like things are trending upwards. From a subjective standpoint, it just feels like there is more going on both on mahout-user@lucene.a.o and mahout-dev@lucene.a.o. It also feels like people are starting to kick the tires, which is vital. Furthermore, it also seems like we’ve gotten a few more contributors as late, which is, of course, key for any open source project as well. The coming months will be really important for us existing contributors to make sure we help newcomers have a good experience and address their issues.
In looking at the mailing list subscriptions, the user list is over 300 subscribers now (I believe a few months ago it was just under 250) and the dev list is up a little bit, although not a lot. Of course, subscription rates come and go, but we’ve been pretty solid around these numbers for a while, which is a good sign that people are still interested, IMO.
From a personal standpoint, I’ve had a bit more time to work on it, which gets me jazzed up. Right now, I’m working on getting some examples up for clustering documents (see https://issues.apache.org/jira/browse/MAHOUT-126 and https://issues.apache.org/jira/browse/MAHOUT-65), collaborative filtering (using Taste in Mahout) and categorization. The clustering stuff is likely still pretty naive on my part, but it’s a start. The clustering and categorization work will also feed into my book Taming Text.
We also have several Google Summer of Code students involved, which is always enjoyable and a learning experience. I even got to meet the student I’m mentoring (David Hall) in person this year, which was pretty cool. David is implementing Latent Dirichlet Allocation on MapReduce for Mahout. I’m not sure I understand it all just yet, but I trust David will make it clear to me by the end of the summer.
Speaking of meeting people, at that same meetup I finally got to meet fellow committers Jeff Eastman and Ted Dunning. Always nice to put names to faces, having worked with them on Mahout now for well over a year.
June 16th, 2009 | Posted in Latent Dirichlet Allocation, Lucene, Mahout | No Comments
A short high level piece that I did with CIOZone.com on Lucene/Solr and Lucid is available at CIOZone.com – Professional Network for CIOs and IT Professionals.
June 12th, 2009 | Posted in Lucene, Lucid Imagination, Solr | No Comments
Just wanted to follow up on last night’s Lucene/Solr Meetup in San Francisco.
First off, special thanks to all the speakers (Jason Rutherglen, Michael Busch, Erik Hatcher and all the lightning talks.) We had a lot of excellent talks ranging from low level Lucene details on payloads and real time search to high level discussions on new feature in Solr and best practices for working on stopwords and relevance. Also had intros to Mahout, Tika and the new Open Relevance project at Lucene. I’ll post the slides on the Meetup site when they are available (I am still waiting to get them from the speakers.)
Second, I really enjoyed engaging with so many people about what they are working on in Lucene/Solr. It is always fun to hear all the different ways people are (ab)using Lucene/Solr to do cool things, etc. It was especially good to meet some fellow Mahout committers (Ted Dunning and Jeff Eastman) for the first time, as well as one of Mahout’s Google Summer of Code student David Hall, who is working on adding Latent Dirichlet Allocation.
Finally, I look forward to doing more of these. Right now, I’m looking for interest in Raleigh, NC, but I know we’ll likely have another one in the Bay Area again soon.
June 4th, 2009 | Posted in Droids, Hadoop, Java, Latent Dirichlet Allocation, Lucene, Lucid Imagination, Mahout, Open Relevance, Real Time Search, Solr, Tika, canopy clustering, machine learning, relevance | No Comments
Lucid Imagination » Filtered query performance increases for Solr 1.4.
Yonik has a good post on filtered query performance in Solr 1.4 that people might find interesting.
May 28th, 2009 | Posted in Solr | No Comments
SFBay Apache Lucene/Solr Meetup San Mateo, CA – Meetup.com.
Lucene/Solr Meetup, June 3
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/
Join us for an evening of presentations and discussion on
Lucene/Solr/Nutch/Mahout (and the rest of the Lucene ecosystem), the Apache Open Source Search Engine/Platform, featuring:
-Erik Hatcher, Apache Lucene/Solr PMC: Solr power
your data: How to get up an running in 20 minutes or less
-Grant Ingersoll, Apache Lucene/Solr PMC: New in Apache Solr 1.4 — faster performance, better replication, and more
-Additional topics to be posted at the URL shortly
We’d also like to have 15 minute lightning talks where people present their uses of Lucene/Solr/Tika/Mahout/Nutch/Droids.
We’ll have some food and beverages.
RSVP — seats are limited — at http://www.meetup.com/SFBay-Lucene-Solr-Meetup/
Sponsored by: Lucid Imagination
Please email questions of list to talks@lucidimagination.com
May 23rd, 2009 | Posted in Lucene, Mahout, Nutch, Solr | No Comments
Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel.
Daniel Tunkelang has written up an interesting post on the new Open Relevance Project that me and a few other Lucene people are starting up and I thought I would respond here:
Little late to the conversation, but I think maybe we should back up a little bit. I like a lot of the comments and wish they were actually made on general@lucene.apache.org where we are discussing the merits of the undertaking (see http://www.lucidimagination.com/search/document/76d7cdeed4882397) not that I expect that to happen given the way blogs work. At any rate, I’d like to add my two cents as the one who started the thread on general@lucene.apache.org.
First off, the ORP is VERY early stage brainstorming. ORP really doesn’t warrant much attention at this point and it is premature to even speculate about how it relates to TREC, Google, Yahoo!, Microsoft or anything else. I’m not even sure it has enough support to be a viable Lucene subproject! For now, I think most of us who are actually working on the genesis of the project are merely looking for a means to improve Lucene (and also Solr, Nutch and Mahout), despite what Otis says in his blog post about having grander notions for comparing across engines.
So, to the background…
This (ORP) is something I’ve been thinking about for a long time now and have discussed with a number of people in the past. The motivation comes from my frustration over the years in not being able to obtain data that everyone on Lucene can use without limitations, since I’ve almost always worked in places that had little money to spend on this kind of thing.
See http://www.lucidimagination.com/search/document/656d5ca50c8c9242, http://lucene.grantingersoll.com/category/trec/ and http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/ for background. The second motivation is simply to have practical, real world data driven by actual users.
In the past, I have talked with both NIST and Sheffield to try to work out terms by which the Lucene community could obtain TREC resources, but the licensing terms simply prevent a totally free redistribution. (BTW, this is not NIST/Sheffield’s fault, but the company that allows them to use the data. NIST/Sheffield are doing the best they can given their constraints.) I have also talked with a few commercial companies that redistribute data (blogs, etc.) all to no avail (it’s usually the copyright that kills it.) If the ASF were to buy the dataset, we could distribute to the committers on Lucene according to the licensing terms, but not to the broader community and we’d have to maintain a list of who has it, etc. See the terms here, for instance. Since many of the best ideas come from the community in Open Source and you never know when and where they come from, I deemed this unacceptable and decided not to pursue it even though the ASF authorized me to go forward with it (i.e. spend the money) if I wanted to. After all, it is only a few hundred bucks.
To me, it is vital that there be an open and FREE means for doing relevance tests that the Lucene community can use to improve itself. If others can benefit, so be it. Much like Lucene developed a benchmarking tool for people to share performance tests (both speed and relevance) in a straightforward way (see the contrib/benchmark section of the Lucene distribution), so to is there a need for us (speaking, unofficially, for Lucene) to talk about relevance in a public way so we can compare notes just as any two researchers buried in the bowels of a commercial company might compare notes. Many, many people have used Lucene to do TREC (in fact, I have), but it is a showstopper when the other person you are discussing relevance with can’t just pick up the exact same bits (corpus, queries, judgments) and run the exact same tests. In other words, the goal is not to compare competing offerings, IMO, (although it will likely happen b/c that is human nature) it is to give Lucene users a common way of evaluating and talking about relevance.
As anyone familiar with Lucene knows, ORP will be driven by the people that show up and volunteer to contribute to it, as are all Apache projects. Thus, the slate really is clean. If anyone (and I truly mean anyone, not just Lucene users, even though that is the preliminary focus) is interested, please show up and discuss over at general@lucene.apache.org. We’d welcome the ideas and, moreover, any efforts.
May 18th, 2009 | Posted in Apache, Lucene, Mahout, Open Relevance, Performance, Solr, machine learning, relevance | 2 Comments
Lucid Imagination » Exploring Lucene and Solr’s TrieRange Capabilities.
Post title says it all. Just ran a few simple experiments with the TrieRange capabilities in Lucene and Solr. Pretty cool stuff and much needed.
May 13th, 2009 | Posted in Lucene, Solr | No Comments
Lucid Imagination » Lucene/Solr Meetup / May 20th, Reston VA, 6-8:30 pm.
FYI, Lucid is sponsoring an Apache Lucene/Solr meetup on May 2oth in Reston VA.
May 11th, 2009 | Posted in Apache, Lucene, Solr | No Comments