Roundup Tracker - Issues

Message3864

Author ThomasAH
Recipients ThomasAH, ber
Date 2009-09-01.15:26:05
Message-id <1251818768.07.0.621541080617.issue2550584@psf.upfronthosting.co.za>
In-reply-to
In roundup-1.4.9 and current SVN \b\w{2,25}\b is used for the regular
expression finding all words to be indexed, but the comments and the
find code uses 3 as the lower limit.

indexer_dbm.py and indexer_rdbms.py do not filter stopwords from the
list of words to be searched, therefore searching for "foo with bar"
will never find anything, because WITH is in the STOPWORDS default.

The attached patch (against SVN, works against 1.4.9, too) makes minimum
and maximum length of words to be indexed easily changeable in one
source location, which could easily be extended to a config option if
anyone wants (in a separate patch), and consistently uses this while
generating and using the index.
Additionally it consistently uses the stopwords when finding (the xapian
find already did this).

I chose 2 as for the minimum word length for two reasons:
1. Existing indexes will already have words of this length included.
   (in one of our trackers there are about 50000 entries with two
letters, about 700000 entries with more letters and about 150000 entries
would be added when using an empty STOPWORDS set)
2. Searching for two-letter words could really be useful, e.g. for
search terms like "HP UX" or "Windows XP".
History
Date User Action Args
2009-09-01 15:26:08ThomasAHsetmessageid: <1251818768.07.0.621541080617.issue2550584@psf.upfronthosting.co.za>
2009-09-01 15:26:08ThomasAHsetrecipients: + ThomasAH, ber
2009-09-01 15:26:07ThomasAHlinkissue2550584 messages
2009-09-01 15:26:06ThomasAHcreate