Roundup Tracker - Issues

Issue 1046612

classification
Stopwords in full-text indexer
Type: Severity: normal
Components: Database Versions:
process
Status: closed fixed
:
: richard : a1s, jlgijsbers, richard
Priority: urgent :

Created on 2004-10-13 21:26 by richard, last changed 2004-10-14 06:03 by a1s.

Messages
msg1480 Author: [hidden] (richard) Date: 2004-10-13 21:26
The full-text indexer doesn't have any stopwords. Don't 
know how I could have forgotten them. The following 
words should never be indexed (from ZCTextIndex, which 
takes its words from Lucene): 
 
_words = [ 
    "a", "and", "are", "as", "at", "be", "but", "by", 
    "for", "if", "in", "into", "is", "it", 
    "no", "not", "of", "on", "or", "such", 
    "that", "the", "their", "then", "there", "these", 
    "they", "this", "to", "was", "will", "with" 
] 
 
(not that 2-char words are ever indexed anyway). 
 
Also, for the roundup project, "roundup" should be in the 
stopwords list. 
 
msg1481 Author: [hidden] (jlgijsbers) Date: 2004-10-13 22:16
Logged In: YES 
user_id=469548

On the pydotorg database with all SF issues imported:

SELECT count(*) FROM __words WHERE _word IN ('A', 'AND',
'ARE', 'AS', 'AT', 'BE', 'BUT', 'BY', 'FOR', 'IF', 'IN',
'INTO', 'IS', 'IT', 'NO', 'NOT', 'OF', 'ON', 'OR', 'SUCH',
'THAT', 'THE', 'THEIR', 'THEN', 'THERE', 'THESE', 'THEY',
'THIS', 'TO', 'WAS', 'WILL', 'WITH')

gets me

 count
--------
 222001
(1 row)

So this would eliminate about 10% of the __words table. Nice!
msg1482 Author: [hidden] (a1s) Date: 2004-10-14 06:03
Logged In: YES 
user_id=8719

please keep in mind that the set of words is nls-dependent
msg1483 Author: [hidden] (richard) Date: 2004-10-14 08:45
Logged In: YES 
user_id=6405

The indexer itself is very much ASCII-limited at the moment - see 
bug 
https://sourceforge.net/tracker/index.php?func=detail&aid=780733&group_id=31577&atid=402788 
for another of its limitations in this area. 
 
These stopwords will be a good start. 
 
Johannes - what's your count for "python". Just thinking that we 
should allow the tracker config to define additional stopwords 
specific to the tracker. Actually, given the loading you've done, 
I'd be interested to see a "select word,count(*) ... group by word" 
output, sorted by the count... there might be other words we can 
ignore too... 
msg1484 Author: [hidden] (richard) Date: 2004-10-14 09:07
Logged In: YES 
user_id=6405

Just out of curiousity, I poked at our database, which has 
154524 words in it. The top 20 words are: 
 
  1948 | TO 
  1597 | AND 
  1567 | OF 
  1514 | APPLICATION 
  1482 | IN 
  1481 | COM 
  1083 | MSWORD 
  1073 | FOR 
  1071 | IS 
  1065 | AU 
  1047 | ON 
   963 | DOC 
   913 | BE 
   908 | COMMONGROUND 
   880 | IT 
   792 | WE 
   696 | AT 
   694 | AS 
   642 | HAVE 
   639 | PDF 
 
(note of course that my statement about 2-char words being 
discarded is crap - it's *1* char and >25 char words that are 
discarded) 
msg1485 Author: [hidden] (jlgijsbers) Date: 2004-10-14 09:20
Logged In: YES 
user_id=469548

The top 20 words on my database are:

     _word    | count
-------------+-------
 TO          | 21574
 IN          | 18284
 IS          | 17096
 IT          | 16271
 OF          | 14541
 FOR         | 14351
 AND         | 14346
 USER        | 11894
 SOURCEFORGE | 11692
 BE          | 11637
 ON          | 11147
 VALUE       | 10203
 IF          | 10093
 DATE        |  9798
 PYTHON      |  9511
 NOT         |  9420
 PATCH       |  8551
 100         |  8426
 OLD         |  8048
 YOU         |  7933
History
Date User Action Args
2004-10-13 21:26:51richardcreate