<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Opening up Academic Research on IR and Machine Learning</title>
	<atom:link href="http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/</link>
	<description>Thoughts on Apache Lucene, Mahout, Solr, Tika and Nutch</description>
	<lastBuildDate>Wed, 24 Feb 2010 13:22:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/comment-page-1/#comment-6531</link>
		<dc:creator>Copying TREC is the Wrong Track for the Enterprise &#124; The Noisy Channel</dc:creator>
		<pubDate>Tue, 19 May 2009 03:23:20 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110#comment-6531</guid>
		<description>[...] http://lucene.grantingersoll.com/category/trec/ and http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/ for background.  The second motivation is simply to have practical, real world data driven by [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://lucene.grantingersoll.com/category/trec/" rel="nofollow">http://lucene.grantingersoll.com/category/trec/</a> and <a href="http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/" rel="nofollow">http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/</a> for background.  The second motivation is simply to have practical, real world data driven by [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hannes Carl Meyer</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/comment-page-1/#comment-6131</link>
		<dc:creator>Hannes Carl Meyer</dc:creator>
		<pubDate>Sat, 20 Sep 2008 12:47:34 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110#comment-6131</guid>
		<description>P.S.: Just bought your book and looked into the early access papers, congratulations!</description>
		<content:encoded><![CDATA[<p>P.S.: Just bought your book and looked into the early access papers, congratulations!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hannes Carl Meyer</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/comment-page-1/#comment-6130</link>
		<dc:creator>Hannes Carl Meyer</dc:creator>
		<pubDate>Sat, 20 Sep 2008 11:59:41 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110#comment-6130</guid>
		<description>I didn&#039;t do any research in Mahout yet, but I think Apache UIMA framework is a good example in how to provide a common platform. 

We started our development on a niche text processing software 5 years ago. Thats when we got into IBM&#039;s UIMA framework and integrated it into our software right in the beginning.
Beside integrating our own text analysis methods we were looking a lot into university paper publications which were sometimes either hard to understand and not even practical at all.

This changed couple years ago when I saw more and more folks on universities starting to work with UIMA as a platform. They even started exchanging components each other. On a congress about UIMA last year I saw 8 project presentations of different UIMA projects, 6 of them from university, 2 of them from the industry (1 from the netherland, one from germany)

Over here in germany it looks like the university research in text processing is already ahead of the industry and the industry have to get into to benefit from research.

I will do my Mahout homework this weekend.</description>
		<content:encoded><![CDATA[<p>I didn&#8217;t do any research in Mahout yet, but I think Apache UIMA framework is a good example in how to provide a common platform. </p>
<p>We started our development on a niche text processing software 5 years ago. Thats when we got into IBM&#8217;s UIMA framework and integrated it into our software right in the beginning.<br />
Beside integrating our own text analysis methods we were looking a lot into university paper publications which were sometimes either hard to understand and not even practical at all.</p>
<p>This changed couple years ago when I saw more and more folks on universities starting to work with UIMA as a platform. They even started exchanging components each other. On a congress about UIMA last year I saw 8 project presentations of different UIMA projects, 6 of them from university, 2 of them from the industry (1 from the netherland, one from germany)</p>
<p>Over here in germany it looks like the university research in text processing is already ahead of the industry and the industry have to get into to benefit from research.</p>
<p>I will do my Mahout homework this weekend.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grant_ingersoll</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/comment-page-1/#comment-6129</link>
		<dc:creator>grant_ingersoll</dc:creator>
		<pubDate>Fri, 19 Sep 2008 19:18:24 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110#comment-6129</guid>
		<description>Yeah, I totally agree on your points, Bob.  The data problem is particularly true.  It&#039;s interesting, because if I was a physicist, I would be required to publish the steps I took in order to reproduce an experiment.  In our world, we just put up a formula or two, maybe a nice description and some pseudocode and then wave our hands and magically we get a nice, pretty, well-formatted table of results showing a 10% increase in mean average precision or some great F-measure.  Pay no attention to the man behind the curtain.

Sharing the code and configuration is a good starting point.</description>
		<content:encoded><![CDATA[<p>Yeah, I totally agree on your points, Bob.  The data problem is particularly true.  It&#8217;s interesting, because if I was a physicist, I would be required to publish the steps I took in order to reproduce an experiment.  In our world, we just put up a formula or two, maybe a nice description and some pseudocode and then wave our hands and magically we get a nice, pretty, well-formatted table of results showing a 10% increase in mean average precision or some great F-measure.  Pay no attention to the man behind the curtain.</p>
<p>Sharing the code and configuration is a good starting point.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://lucene.grantingersoll.com/2008/09/18/opening-up-academic-research-on-ir-and-machine-learning/comment-page-1/#comment-6128</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Fri, 19 Sep 2008 16:39:14 +0000</pubDate>
		<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=110#comment-6128</guid>
		<description>The tools that are easy to use get widely re-used.  For instance, Adwait Ratnaparkhi&#039;s POS tagger or Collins&#039; parser or the Stanford NE tagger, which present usable tools out of the box that run fro the command line.  Then there are tools like Joachims&#039; SVMLight or Bottou&#039;s SGD that are easy to use from the command line.

There&#039;s a big problem in recreating all the features, because papers usually only contain sketches of what&#039;s used.  And that&#039;s where the accuracy comes from on actual tasks.  And there&#039;s also a big problem recreating heuristic pre- and post-processing.  And then there&#039;s patching together whole pipelines using instances of various resources like Wikipedia, Wordnet, and so on.

The big problem for sharing is scale.  If I run over a terabyte of web data, or even over 50 GB, how do I share that?  TREC has done things like distribute disk drives on their &quot;large&quot; scale tasks, which aren&#039;t even large scale these days.</description>
		<content:encoded><![CDATA[<p>The tools that are easy to use get widely re-used.  For instance, Adwait Ratnaparkhi&#8217;s POS tagger or Collins&#8217; parser or the Stanford NE tagger, which present usable tools out of the box that run fro the command line.  Then there are tools like Joachims&#8217; SVMLight or Bottou&#8217;s SGD that are easy to use from the command line.</p>
<p>There&#8217;s a big problem in recreating all the features, because papers usually only contain sketches of what&#8217;s used.  And that&#8217;s where the accuracy comes from on actual tasks.  And there&#8217;s also a big problem recreating heuristic pre- and post-processing.  And then there&#8217;s patching together whole pipelines using instances of various resources like Wikipedia, Wordnet, and so on.</p>
<p>The big problem for sharing is scale.  If I run over a terabyte of web data, or even over 50 GB, how do I share that?  TREC has done things like distribute disk drives on their &#8220;large&#8221; scale tasks, which aren&#8217;t even large scale these days.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
