Roundup Tracker - Issues

Issue 2551189

classification
search indexer should allow longer words
Type: rfe Severity: normal
Components: Database Versions: 2.2.0
process
Status: fixed fixed
:
: rouilj : ThomasAH, rouilj
Priority: high : Effort-Low

Created on 2022-01-25 07:41 by ThomasAH, last changed 2022-01-26 21:13 by rouilj.

Messages
msg7441 Author: [hidden] (ThomasAH) Date: 2022-01-25 07:41
We recently had some support requests where the browser indicated an
error message with "SSL_ERROR_HANDSHAKE_FAILURE_ALERT". Roundup's search
wasn't able to find these messages, because the indexer is hard coded to
25 characters.

We had other searches in the past where searching failed due to this low
limit, as the German language can easily produce long words, here some
words that have actually been used in our trackers :)
Immatrikulationsbescheinigung (29 letters)
Softwareentwicklungsdienstleistungen (36 letters)
Auftragsdatenverarbeitungsvereinbarungen (40 letters)
Betriebshaftpflichtversicherungsbedingungen (43 letters)

So I guess the maximum length for the search index should be at least 45
characters, but analysis of the data in our trackers shows that even 50
characters does not increase the size of the index too much.
Most "junk" lines are base64 encoded things, which are usually between
64 and 76 characters long.

Problem is, that some database tables need to become wider to allow
increasing the size of the search index, but there has been a recent
change that required changes to the database structure anyway, so this
might be a good time to do this? (Priority "high" because of this)

Here is a patch I use for an older version (almost 1.6) of roundup.
I increased 25 to 50, and some places which had 30 to 60.


diff -r 45ad337e0c4a roundup/backends/back_mysql.py
--- a/roundup/backends/back_mysql.py	Thu Nov 14 13:00:49 2019 +0100
+++ b/roundup/backends/back_mysql.py	Tue Nov 16 17:23:10 2021 +0100
@@ -228,7 +228,7 @@
         self.sql('''CREATE TABLE __textids (_class VARCHAR(255),
             _itemid VARCHAR(255), _prop VARCHAR(255), _textid INT)
             ENGINE=%s'''%self.mysql_backend)
-        self.sql('''CREATE TABLE __words (_word VARCHAR(30),
+        self.sql('''CREATE TABLE __words (_word VARCHAR(60),
             _textid INT) ENGINE=%s'''%self.mysql_backend)
         self.sql('CREATE INDEX words_word_ids ON __words(_word)')
         self.sql('CREATE INDEX words_by_id ON __words (_textid)')
diff -r 45ad337e0c4a roundup/backends/back_postgresql.py
--- a/roundup/backends/back_postgresql.py	Thu Nov 14 13:00:49 2019 +0100
+++ b/roundup/backends/back_postgresql.py	Tue Nov 16 17:23:10 2021 +0100
@@ -206,7 +206,7 @@
         self.sql('''CREATE TABLE __textids (
             _textid integer primary key, _class VARCHAR(255),
             _itemid VARCHAR(255), _prop VARCHAR(255))''')
-        self.sql('''CREATE TABLE __words (_word VARCHAR(30),
+        self.sql('''CREATE TABLE __words (_word VARCHAR(60),
             _textid integer)''')
         self.sql('CREATE INDEX words_word_idx ON __words(_word)')
         self.sql('CREATE INDEX words_by_id ON __words (_textid)')
diff -r 45ad337e0c4a roundup/backends/indexer_common.py
--- a/roundup/backends/indexer_common.py	Thu Nov 14 13:00:49 2019 +0100
+++ b/roundup/backends/indexer_common.py	Tue Nov 16 17:23:10 2021 +0100
@@ -19,10 +19,10 @@
         self.stopwords = set(STOPWORDS)
         for word in db.config[('main', 'indexer_stopwords')]:
             self.stopwords.add(word)
-        # Do not index anything longer than 25 characters since that'll be
+        # Do not index anything longer than 50 characters since that'll be
         # gibberish (encoded text or somesuch) or shorter than 2 characters
         self.minlength = 2
-        self.maxlength = 25
+        self.maxlength = 50
 
     def is_stopword(self, word):
         return word in self.stopwords
diff -r 45ad337e0c4a roundup/cgi/actions.py
--- a/roundup/cgi/actions.py	Thu Nov 14 13:00:49 2019 +0100
+++ b/roundup/cgi/actions.py	Tue Nov 16 17:23:10 2021 +0100
@@ -1319,7 +1319,7 @@
         # full-text search
         if request.search_text:
             matches = self.db.indexer.search(
-                re.findall(r'[\w%]{2,25}', request.search_text), klass)
+                re.findall(r'[\w%]{2,50}', request.search_text), klass)
         else:
             matches = None
 
diff -r 45ad337e0c4a roundup/cgi/templating.py
--- a/roundup/cgi/templating.py	Thu Nov 14 13:00:49 2019 +0100
+++ b/roundup/cgi/templating.py	Tue Nov 16 17:23:10 2021 +0100
@@ -2946,7 +2946,7 @@
         if self.search_text:
             matches = self.client.db.indexer.search(
                 [w.upper().encode("utf-8", "replace") for w in re.findall(
-                    r'(?u)[\w%]{2,25}',
+                    r'(?u)[\w%]{2,50}',
                     unicode(self.search_text, "utf-8", "replace")
                 )], klass)
         else:
msg7442 Author: [hidden] (rouilj) Date: 2022-01-25 18:27
Hi Thomas:

I have started on this. My first commit:

 changeset:   6593:e70e2789bc2c

removed all the magic numbers 25, 30 from the code and replaced them with references to
indexer.maxlength. I added testing for the code paths and sent it to CI.

If nothing else, this reduces your diff to the one liner:

-        self.maxlength = 25
+        self.maxlength = 50

I'll try to get to the rest of the changes (db migration, modify maxlength, tests) tomorrow
or the next day I hope.
msg7443 Author: [hidden] (rouilj) Date: 2022-01-26 21:13
It should be set to 50 in changeset:   6599:39189dd94f2c.

While it would be nice to allow the user to customize this, the current
mechanism for implementing database schema changes doesn't really allow
for that.

But at least local customization of this is now a one line change 8-).

One thing to note if you are testing this is that db version 7 has one more change
coming: adding python FTS support. So upgrading from this version will require 
manual database changes to get postgres FTS.
History
Date User Action Args
2022-01-26 21:13:48rouiljsetstatus: open -> fixed
resolution: fixed
messages: + msg7443
versions: + 2.2.0
2022-01-25 18:27:32rouiljsetstatus: new -> open
assignee: rouilj
2022-01-25 18:27:18rouiljsetnosy: + rouilj
messages: + msg7442
2022-01-25 07:41:06ThomasAHcreate