Roundup Tracker - Issues

Message4058

Author jvstein
Recipients ThomasAH, ber, jvstein, olly
Date 2010-05-10.22:15:47
Message-id <1273529748.4.0.476912199405.issue2550583@psf.upfronthosting.co.za>
In-reply-to
Bernhard,

From what I understand, Roundup uses the Porter2 stemming algorithm exposed by Xapian.
    http://snowball.tartarus.org/algorithms/english/stemmer.html

The original Porter algorithm requires lowercase input. Take a look at some of the reference 
implementations here.
    http://tartarus.org/~martin/PorterStemmer/

The only Xapian reference I found was on their intro page and is hardly prescriptive 
(http://xapian.org/docs/intro_ir.html).
   "Usually they are converted to lower case, and often a stemming algorithm is applied"

The problem is that stemming doesn't work properly. "Silently" should stem to "silent", not 
"SILENTLi". A search for "silently" should return pages that contain the word "silent" and vice 
versa.

A simple test would be to index a document containing the word "silently" and ensure that a 
search on the term "silent" returns the same document.

--Jeff
History
Date User Action Args
2010-05-10 22:15:48jvsteinsetmessageid: <1273529748.4.0.476912199405.issue2550583@psf.upfronthosting.co.za>
2010-05-10 22:15:48jvsteinsetrecipients: + jvstein, ber, ThomasAH, olly
2010-05-10 22:15:48jvsteinlinkissue2550583 messages
2010-05-10 22:15:47jvsteincreate