Roundup Tracker - Issues

Issue 1344046

classification
Title: Search for "All text" can't find some Unicode words
Type: behavior Severity: normal
Components: Database Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: richard Nosy List: pefu, richard, rouilj
Priority: normal Keywords: patch

Created on 2005-10-31 16:33 by anonymous, last changed 2019-04-29 00:45 by rouilj.

Messages
msg2046 Author: [hidden] (anonymous) Date: 2005-10-31 16:33
The fulltext search implemented in
backends\indexer_rdbms.py is not able to find words
having specific unicode characters in them. One such
character is the german 'u umlaut' ('ü'), which does
not survive the upper() statement in find().

E. g., if you search for 'Sprünge', wordlist first
contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'.

To fix this, i replaced line 82:

82c82,83
<         l = [word.upper() for word in wordlist if 26
> len(word) > 1]
---
>         l = [unicode(word, "utf-8",
"replace").upper().encode("utf-8", "replace")
>             for word in wordlist if 26 > len(word) > 1]

woe@gmx.net
msg2047 Author: [hidden] (anonymous) Date: 2007-01-29 18:25
Logged In: NO 

Words with UTF-8 characters are wrongly detected in indexer_ backends.
UTF-8 characters splits words now.

Original in indexer_xapian.py:
for match in re.finditer(r'\b\w{2,25}\b', text.upper()):
  word = match.group(0)

OK:
for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE):
  word = match.group(0).encode("utf-8", "replace")
msg6451 Author: [hidden] (pefu) Date: 2019-04-03 10:46
Today a coworker surprised me: Indeed he was unable to find a certain
issue in our database because the search term he used contained a german
umlaut. 

Is there any technical reason why the proposed patch has not been
applied in the roundup code base during the past twelve years?

Best regards, Peter Funk
msg6469 Author: [hidden] (rouilj) Date: 2019-04-29 00:45
Hi Peter. I don't know a technical reason why the code wasn't changed.
This patch came in when I wasn't involved with roundup.

Full text indexers like xapian or whoosh I thought were meant
to handle this better than native.

From a process level, there are no tests associated with this
patch. Support for this needs testing on all rdbms back ends at
minimum.

Also support for anydbm should be added, or documentation updates are 
needed to explain this change.
History
Date User Action Args
2019-04-29 00:45:08rouiljsetnosy: + rouilj
messages: + msg6469
2019-04-03 10:46:07pefusetnosy: + pefu
messages: + msg6451
2016-06-26 19:29:22rouiljsetkeywords: + patch
type: behavior
2005-10-31 16:33:51anonymouscreate