Roundup Tracker - Issues

Issue 1344046

classification
Title: Search for "All text" can't find some Unicode words
Type: behavior Severity: normal
Components: Database Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: richard Nosy List: pefu, richard
Priority: normal Keywords: patch

Created on 2005-10-31 16:33 by anonymous, last changed 2019-04-03 10:46 by pefu.

Messages
msg2046 Author: [hidden] (anonymous) Date: 2005-10-31 16:33
The fulltext search implemented in
backends\indexer_rdbms.py is not able to find words
having specific unicode characters in them. One such
character is the german 'u umlaut' ('ü'), which does
not survive the upper() statement in find().

E. g., if you search for 'Sprünge', wordlist first
contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'.

To fix this, i replaced line 82:

82c82,83
<         l = [word.upper() for word in wordlist if 26
> len(word) > 1]
---
>         l = [unicode(word, "utf-8",
"replace").upper().encode("utf-8", "replace")
>             for word in wordlist if 26 > len(word) > 1]

woe@gmx.net
msg2047 Author: [hidden] (anonymous) Date: 2007-01-29 18:25
Logged In: NO 

Words with UTF-8 characters are wrongly detected in indexer_ backends.
UTF-8 characters splits words now.

Original in indexer_xapian.py:
for match in re.finditer(r'\b\w{2,25}\b', text.upper()):
  word = match.group(0)

OK:
for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE):
  word = match.group(0).encode("utf-8", "replace")
msg6451 Author: [hidden] (pefu) Date: 2019-04-03 10:46
Today a coworker surprised me: Indeed he was unable to find a certain
issue in our database because the search term he used contained a german
umlaut. 

Is there any technical reason why the proposed patch has not been
applied in the roundup code base during the past twelve years?

Best regards, Peter Funk
History
Date User Action Args
2019-04-03 10:46:07pefusetnosy: + pefu
messages: + msg6451
2016-06-26 19:29:22rouiljsetkeywords: + patch
type: behavior
2005-10-31 16:33:51anonymouscreate