Roundup Tracker - Issues

Issue 1344046

Title: Search for "All text" can't find some Unicode words
Type: behavior Severity: normal
Components: Database Versions:
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rouilj Nosy List: ezio.melotti, pefu, richard, rouilj
Priority: normal Keywords: patch

Created on 2005-10-31 16:33 by anonymous, last changed 2019-10-11 01:27 by rouilj.

msg2046 Author: [hidden] (anonymous) Date: 2005-10-31 16:33
The fulltext search implemented in
backends\ is not able to find words
having specific unicode characters in them. One such
character is the german 'u umlaut' ('ü'), which does
not survive the upper() statement in find().

E. g., if you search for 'Sprünge', wordlist first
contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'.

To fix this, i replaced line 82:

<         l = [word.upper() for word in wordlist if 26
> len(word) > 1]
>         l = [unicode(word, "utf-8",
"replace").upper().encode("utf-8", "replace")
>             for word in wordlist if 26 > len(word) > 1]
msg2047 Author: [hidden] (anonymous) Date: 2007-01-29 18:25
Logged In: NO 

Words with UTF-8 characters are wrongly detected in indexer_ backends.
UTF-8 characters splits words now.

Original in
for match in re.finditer(r'\b\w{2,25}\b', text.upper()):
  word =

for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE):
  word ="utf-8", "replace")
msg6451 Author: [hidden] (pefu) Date: 2019-04-03 10:46
Today a coworker surprised me: Indeed he was unable to find a certain
issue in our database because the search term he used contained a german

Is there any technical reason why the proposed patch has not been
applied in the roundup code base during the past twelve years?

Best regards, Peter Funk
msg6469 Author: [hidden] (rouilj) Date: 2019-04-29 00:45
Hi Peter. I don't know a technical reason why the code wasn't changed.
This patch came in when I wasn't involved with roundup.

Full text indexers like xapian or whoosh I thought were meant
to handle this better than native.

From a process level, there are no tests associated with this
patch. Support for this needs testing on all rdbms back ends at

Also support for anydbm should be added, or documentation updates are 
needed to explain this change.
msg6729 Author: [hidden] (rouilj) Date: 2019-10-11 01:27

Can you take a look at this ticket and patch. You have more knowledge about unicode and conversions than I do.

If you say it looks good and it passes existing tests (or if you hve a suggestion on how to add a test for it) I'll commit for 2.0.0 alpha.

-- rouilj
Date User Action Args
2019-10-11 01:27:40rouiljsetassignee: richard -> rouilj
messages: + msg6729
nosy: + ezio.melotti
2019-04-29 00:45:08rouiljsetnosy: + rouilj
messages: + msg6469
2019-04-03 10:46:07pefusetnosy: + pefu
messages: + msg6451
2016-06-26 19:29:22rouiljsetkeywords: + patch
type: behavior
2005-10-31 16:33:51anonymouscreate