Roundup Tracker - Issues

Issue 1344046

classification
Title: Search for "All text" can't find some Unicode words
Type: behavior Severity: normal
Components: Database Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: richard Nosy List: richard
Priority: normal Keywords: patch

Created on 2005-10-31 16:33 by anonymous, last changed 2016-06-26 19:29 by rouilj.

Messages
msg2046 Author: [hidden] (anonymous) Date: 2005-10-31 16:33
The fulltext search implemented in
backends\indexer_rdbms.py is not able to find words
having specific unicode characters in them. One such
character is the german 'u umlaut' ('ü'), which does
not survive the upper() statement in find().

E. g., if you search for 'Sprünge', wordlist first
contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'.

To fix this, i replaced line 82:

82c82,83
<         l = [word.upper() for word in wordlist if 26
> len(word) > 1]
---
>         l = [unicode(word, "utf-8",
"replace").upper().encode("utf-8", "replace")
>             for word in wordlist if 26 > len(word) > 1]

woe@gmx.net
msg2047 Author: [hidden] (anonymous) Date: 2007-01-29 18:25
Logged In: NO 

Words with UTF-8 characters are wrongly detected in indexer_ backends.
UTF-8 characters splits words now.

Original in indexer_xapian.py:
for match in re.finditer(r'\b\w{2,25}\b', text.upper()):
  word = match.group(0)

OK:
for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE):
  word = match.group(0).encode("utf-8", "replace")
History
Date User Action Args
2016-06-26 19:29:22rouiljsetkeywords: + patch
type: behavior
2005-10-31 16:33:51anonymouscreate