Issue 2550584
Created on 2009-09-01 15:26 by ThomasAH, last changed 2009-09-11 15:56 by ber.
msg3864 |
Author: [hidden] (ThomasAH) |
Date: 2009-09-01 15:26 |
|
In roundup-1.4.9 and current SVN \b\w{2,25}\b is used for the regular
expression finding all words to be indexed, but the comments and the
find code uses 3 as the lower limit.
indexer_dbm.py and indexer_rdbms.py do not filter stopwords from the
list of words to be searched, therefore searching for "foo with bar"
will never find anything, because WITH is in the STOPWORDS default.
The attached patch (against SVN, works against 1.4.9, too) makes minimum
and maximum length of words to be indexed easily changeable in one
source location, which could easily be extended to a config option if
anyone wants (in a separate patch), and consistently uses this while
generating and using the index.
Additionally it consistently uses the stopwords when finding (the xapian
find already did this).
I chose 2 as for the minimum word length for two reasons:
1. Existing indexes will already have words of this length included.
(in one of our trackers there are about 50000 entries with two
letters, about 700000 entries with more letters and about 150000 entries
would be added when using an empty STOPWORDS set)
2. Searching for two-letter words could really be useful, e.g. for
search terms like "HP UX" or "Windows XP".
|
msg3876 |
Author: [hidden] (ber) |
Date: 2009-09-11 15:56 |
|
Enchanced test_indexer.py to at least trigger each problem once,
commited with revision 4355.
Then committed an improved fix with revision 4356.
|
|
Date |
User |
Action |
Args |
2009-09-11 15:56:44 | ber | set | status: new -> closed resolution: fixed messages:
+ msg3876 |
2009-09-11 14:42:10 | ber | set | priority: urgent assignee: ber |
2009-09-01 15:26:07 | ThomasAH | create | |
|