Issue 1046612
Created on 2004-10-13 21:26 by richard, last changed 2004-10-14 06:03 by a1s.
msg1480 |
Author: [hidden] (richard) |
Date: 2004-10-13 21:26 |
|
The full-text indexer doesn't have any stopwords. Don't
know how I could have forgotten them. The following
words should never be indexed (from ZCTextIndex, which
takes its words from Lucene):
_words = [
"a", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
]
(not that 2-char words are ever indexed anyway).
Also, for the roundup project, "roundup" should be in the
stopwords list.
|
msg1481 |
Author: [hidden] (jlgijsbers) |
Date: 2004-10-13 22:16 |
|
Logged In: YES
user_id=469548
On the pydotorg database with all SF issues imported:
SELECT count(*) FROM __words WHERE _word IN ('A', 'AND',
'ARE', 'AS', 'AT', 'BE', 'BUT', 'BY', 'FOR', 'IF', 'IN',
'INTO', 'IS', 'IT', 'NO', 'NOT', 'OF', 'ON', 'OR', 'SUCH',
'THAT', 'THE', 'THEIR', 'THEN', 'THERE', 'THESE', 'THEY',
'THIS', 'TO', 'WAS', 'WILL', 'WITH')
gets me
count
--------
222001
(1 row)
So this would eliminate about 10% of the __words table. Nice!
|
msg1482 |
Author: [hidden] (a1s) |
Date: 2004-10-14 06:03 |
|
Logged In: YES
user_id=8719
please keep in mind that the set of words is nls-dependent
|
msg1483 |
Author: [hidden] (richard) |
Date: 2004-10-14 08:45 |
|
Logged In: YES
user_id=6405
The indexer itself is very much ASCII-limited at the moment - see
bug
https://sourceforge.net/tracker/index.php?func=detail&aid=780733&group_id=31577&atid=402788
for another of its limitations in this area.
These stopwords will be a good start.
Johannes - what's your count for "python". Just thinking that we
should allow the tracker config to define additional stopwords
specific to the tracker. Actually, given the loading you've done,
I'd be interested to see a "select word,count(*) ... group by word"
output, sorted by the count... there might be other words we can
ignore too...
|
msg1484 |
Author: [hidden] (richard) |
Date: 2004-10-14 09:07 |
|
Logged In: YES
user_id=6405
Just out of curiousity, I poked at our database, which has
154524 words in it. The top 20 words are:
1948 | TO
1597 | AND
1567 | OF
1514 | APPLICATION
1482 | IN
1481 | COM
1083 | MSWORD
1073 | FOR
1071 | IS
1065 | AU
1047 | ON
963 | DOC
913 | BE
908 | COMMONGROUND
880 | IT
792 | WE
696 | AT
694 | AS
642 | HAVE
639 | PDF
(note of course that my statement about 2-char words being
discarded is crap - it's *1* char and >25 char words that are
discarded)
|
msg1485 |
Author: [hidden] (jlgijsbers) |
Date: 2004-10-14 09:20 |
|
Logged In: YES
user_id=469548
The top 20 words on my database are:
_word | count
-------------+-------
TO | 21574
IN | 18284
IS | 17096
IT | 16271
OF | 14541
FOR | 14351
AND | 14346
USER | 11894
SOURCEFORGE | 11692
BE | 11637
ON | 11147
VALUE | 10203
IF | 10093
DATE | 9798
PYTHON | 9511
NOT | 9420
PATCH | 8551
100 | 8426
OLD | 8048
YOU | 7933
|
|
Date |
User |
Action |
Args |
2004-10-13 21:26:51 | richard | create | |
|