Issue 2551189: search indexer should allow longer words - Roundup tracker

classification

Title:	search indexer should allow longer words
Type:	rfe	Severity:	normal
Components:	Database	Versions:	2.2.0

process

Status:	fixed	Resolution:	fixed
Dependencies		Superseder:
Assigned To:	rouilj	Nosy List:	ThomasAH, rouilj
Priority:	high	Keywords:	Effort-Low

Created on 2022-01-25 07:41 by ThomasAH, last changed 2022-01-26 21:13 by rouilj.

Messages
msg7441	Author: [hidden] (ThomasAH)	Date: 2022-01-25 07:41
We recently had some support requests where the browser indicated an error message with "SSL_ERROR_HANDSHAKE_FAILURE_ALERT". Roundup's search wasn't able to find these messages, because the indexer is hard coded to 25 characters. We had other searches in the past where searching failed due to this low limit, as the German language can easily produce long words, here some words that have actually been used in our trackers :) Immatrikulationsbescheinigung (29 letters) Softwareentwicklungsdienstleistungen (36 letters) Auftragsdatenverarbeitungsvereinbarungen (40 letters) Betriebshaftpflichtversicherungsbedingungen (43 letters) So I guess the maximum length for the search index should be at least 45 characters, but analysis of the data in our trackers shows that even 50 characters does not increase the size of the index too much. Most "junk" lines are base64 encoded things, which are usually between 64 and 76 characters long. Problem is, that some database tables need to become wider to allow increasing the size of the search index, but there has been a recent change that required changes to the database structure anyway, so this might be a good time to do this? (Priority "high" because of this) Here is a patch I use for an older version (almost 1.6) of roundup. I increased 25 to 50, and some places which had 30 to 60. diff -r 45ad337e0c4a roundup/backends/back_mysql.py --- a/roundup/backends/back_mysql.py Thu Nov 14 13:00:49 2019 +0100 +++ b/roundup/backends/back_mysql.py Tue Nov 16 17:23:10 2021 +0100 @@ -228,7 +228,7 @@ self.sql('''CREATE TABLE __textids (_class VARCHAR(255), _itemid VARCHAR(255), _prop VARCHAR(255), _textid INT) ENGINE=%s'''%self.mysql_backend) - self.sql('''CREATE TABLE __words (_word VARCHAR(30), + self.sql('''CREATE TABLE __words (_word VARCHAR(60), _textid INT) ENGINE=%s'''%self.mysql_backend) self.sql('CREATE INDEX words_word_ids ON __words(_word)') self.sql('CREATE INDEX words_by_id ON __words (_textid)') diff -r 45ad337e0c4a roundup/backends/back_postgresql.py --- a/roundup/backends/back_postgresql.py Thu Nov 14 13:00:49 2019 +0100 +++ b/roundup/backends/back_postgresql.py Tue Nov 16 17:23:10 2021 +0100 @@ -206,7 +206,7 @@ self.sql('''CREATE TABLE __textids ( _textid integer primary key, _class VARCHAR(255), _itemid VARCHAR(255), _prop VARCHAR(255))''') - self.sql('''CREATE TABLE __words (_word VARCHAR(30), + self.sql('''CREATE TABLE __words (_word VARCHAR(60), _textid integer)''') self.sql('CREATE INDEX words_word_idx ON __words(_word)') self.sql('CREATE INDEX words_by_id ON __words (_textid)') diff -r 45ad337e0c4a roundup/backends/indexer_common.py --- a/roundup/backends/indexer_common.py Thu Nov 14 13:00:49 2019 +0100 +++ b/roundup/backends/indexer_common.py Tue Nov 16 17:23:10 2021 +0100 @@ -19,10 +19,10 @@ self.stopwords = set(STOPWORDS) for word in db.config[('main', 'indexer_stopwords')]: self.stopwords.add(word) - # Do not index anything longer than 25 characters since that'll be + # Do not index anything longer than 50 characters since that'll be # gibberish (encoded text or somesuch) or shorter than 2 characters self.minlength = 2 - self.maxlength = 25 + self.maxlength = 50 def is_stopword(self, word): return word in self.stopwords diff -r 45ad337e0c4a roundup/cgi/actions.py --- a/roundup/cgi/actions.py Thu Nov 14 13:00:49 2019 +0100 +++ b/roundup/cgi/actions.py Tue Nov 16 17:23:10 2021 +0100 @@ -1319,7 +1319,7 @@ # full-text search if request.search_text: matches = self.db.indexer.search( - re.findall(r'[\w%]{2,25}', request.search_text), klass) + re.findall(r'[\w%]{2,50}', request.search_text), klass) else: matches = None diff -r 45ad337e0c4a roundup/cgi/templating.py --- a/roundup/cgi/templating.py Thu Nov 14 13:00:49 2019 +0100 +++ b/roundup/cgi/templating.py Tue Nov 16 17:23:10 2021 +0100 @@ -2946,7 +2946,7 @@ if self.search_text: matches = self.client.db.indexer.search( [w.upper().encode("utf-8", "replace") for w in re.findall( - r'(?u)[\w%]{2,25}', + r'(?u)[\w%]{2,50}', unicode(self.search_text, "utf-8", "replace") )], klass) else:
msg7442	Author: [hidden] (rouilj)	Date: 2022-01-25 18:27
Hi Thomas: I have started on this. My first commit: changeset: 6593:e70e2789bc2c removed all the magic numbers 25, 30 from the code and replaced them with references to indexer.maxlength. I added testing for the code paths and sent it to CI. If nothing else, this reduces your diff to the one liner: - self.maxlength = 25 + self.maxlength = 50 I'll try to get to the rest of the changes (db migration, modify maxlength, tests) tomorrow or the next day I hope.
msg7443	Author: [hidden] (rouilj)	Date: 2022-01-26 21:13
It should be set to 50 in changeset: 6599:39189dd94f2c. While it would be nice to allow the user to customize this, the current mechanism for implementing database schema changes doesn't really allow for that. But at least local customization of this is now a one line change 8-). One thing to note if you are testing this is that db version 7 has one more change coming: adding python FTS support. So upgrading from this version will require manual database changes to get postgres FTS.

History
Date	User	Action	Args
2022-01-26 21:13:48	rouilj	set	status: open -> fixed resolution: fixed messages: + msg7443 versions: + 2.2.0
2022-01-25 18:27:32	rouilj	set	status: new -> open assignee: rouilj
2022-01-25 18:27:18	rouilj	set	nosy: + rouilj messages: + msg7442
2022-01-25 07:41:06	ThomasAH	create