Roundup Tracker - Issues

Issue 1344046

classification
Search for "All text" can't find some Unicode words
Type: behavior Severity: normal
Components: Database Versions:
process
Status: fixed fixed
:
: rouilj : ezio.melotti, pefu, richard, rouilj
Priority: normal : patch

Created on 2005-10-31 16:33 by anonymous, last changed 2019-10-30 22:07 by rouilj.

Messages
msg2046 Author: [hidden] (anonymous) Date: 2005-10-31 16:33
The fulltext search implemented in
backends\indexer_rdbms.py is not able to find words
having specific unicode characters in them. One such
character is the german 'u umlaut' ('ü'), which does
not survive the upper() statement in find().

E. g., if you search for 'Sprünge', wordlist first
contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'.

To fix this, i replaced line 82:

82c82,83
<         l = [word.upper() for word in wordlist if 26
> len(word) > 1]
---
>         l = [unicode(word, "utf-8",
"replace").upper().encode("utf-8", "replace")
>             for word in wordlist if 26 > len(word) > 1]

woe@gmx.net
msg2047 Author: [hidden] (anonymous) Date: 2007-01-29 18:25
Logged In: NO 

Words with UTF-8 characters are wrongly detected in indexer_ backends.
UTF-8 characters splits words now.

Original in indexer_xapian.py:
for match in re.finditer(r'\b\w{2,25}\b', text.upper()):
  word = match.group(0)

OK:
for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE):
  word = match.group(0).encode("utf-8", "replace")
msg6451 Author: [hidden] (pefu) Date: 2019-04-03 10:46
Today a coworker surprised me: Indeed he was unable to find a certain
issue in our database because the search term he used contained a german
umlaut. 

Is there any technical reason why the proposed patch has not been
applied in the roundup code base during the past twelve years?

Best regards, Peter Funk
msg6469 Author: [hidden] (rouilj) Date: 2019-04-29 00:45
Hi Peter. I don't know a technical reason why the code wasn't changed.
This patch came in when I wasn't involved with roundup.

Full text indexers like xapian or whoosh I thought were meant
to handle this better than native.

From a process level, there are no tests associated with this
patch. Support for this needs testing on all rdbms back ends at
minimum.

Also support for anydbm should be added, or documentation updates are 
needed to explain this change.
msg6729 Author: [hidden] (rouilj) Date: 2019-10-11 01:27
Ezio:

Can you take a look at this ticket and patch. You have more knowledge about unicode and conversions than I do.

If you say it looks good and it passes existing tests (or if you hve a suggestion on how to add a test for it) I'll commit for 2.0.0 alpha.

-- rouilj
msg6778 Author: [hidden] (ezio.melotti) Date: 2019-10-26 19:21
msg2047 suggests to change the add_text() function in roundup/backends/indexer_xapian.py to decode the text before feeding it to re.finditer() and reencoding the words afterwards (implying text is initially bytes).  

add_text() has changed a bit now, but the call to re.finditer() is still there.  At least on Python 3, the text it receives already seem to be unicode, so the suggested decoding/reencoding is no longer needed.  If it was bytes, the re.finditer() will break because the regex is unicode (and incompatible with bytes).  In Python 2 however, the args might still be bytes (depending on the args sent by the caller), and in that case it would need to be decoded, either in add_text() (probably by using b2s()), or before they get passed to add_text().  add_text() gets called in a number of places (e.g. roundup/backends/rdbms_common.py) and seems to have some tests (in test/test_indexer.py).

Note that the identifier and words are encoded into bytes before sending them to xapian (this might not be necessary, at least on Python 3)

I think in Python 3 the values passed to add_text() should be unicode-only and bytes should be rejected (even if this might be tricky with a shared codebase).  On Python 2 it should support both for backward compatibility, and if the args are bytes they should be decoded before being passed to re.finditer().  If xapian accepts and works with unicode, it might be better to pass unicode in both Py 2 and 3, if not, the identifier and words should be encoded (as it already happens with s2b).  Also consider using .casefold() instead of lower().  Tests should be improved by adding both unicode and bytes args (if supported), and non=ascii text.
msg6779 Author: [hidden] (ezio.melotti) Date: 2019-10-26 19:32
msg2046 suggests to change the find() function in roundup/backends/indexer_rdbms.py to decode the words before passing them to the db (implying the wordlist initially contains bytes).  

As in my previous message, in Python 3 the word list should already be a list of unicode strings, but on Python 2 it might be either a list of bytes, or a list of unicode strings.  Decoding the bytes with b2s in Python 2 should be ok, as long as the db can accept unicode.  Tests should be added to ensure that find() works with both bytes and unicode in Python 2 and that it handles non-ascii words correctly on both versions.
msg6782 Author: [hidden] (rouilj) Date: 2019-10-30 00:53
Taggnostr: I think on python3 everything should work, since it's all
    unicode already but on python2 it might be both, so things might
    not work when it's bytes
rouilj: will do. So on python3, it should just work (assuming index
    can handle unicode)
Taggnostr: on python3 it will also break if it's bytes fwiw all
    operations on text (splitting, slicing, upper/lowercase, etc)
    should be done on unicode only if the text is ascii-only, most
    operations happen to work on bytes too, but if it isn't things
    will break
rouilj: you mention using b2s(text) on python2, but b2s is a no-op in
    python 2.
Taggnostr: maybe it's s2u?
rouilj: wouldn't I want to do an s2u pn python2 to make a unicode
    string then mangle it with case, slicing etc.
rouilj: ok. that makes a little more sense.
Taggnostr: not sure what the name is, but it should decode the input
    if it's bytes, and do nothing if it's unicode
rouilj: def s2u(s, errors='strict'):
rouilj:     """Convert a string object to a Unicode string."""
Taggnostr: and this does nothing if s is already unicode, right?
rouilj: actually it doesn't check: 
rouilj:     if _py3:
rouilj:         return s
rouilj:     else:
rouilj:         return unicode(s, 'utf-8', errors)
Taggnostr: so maybe it's not what we want
rouilj: however in python2, doc in roundup/anypy/string.py says that
    unicode is rarely used except for passing to third parties that
    require unicode.
Taggnostr: if you try to decode a unicode string in python2, python
    will implicitly try to encode it using ascii, and then decode it
    again
Taggnostr: >>> u'Sprünge'.decode('utf-8')
Taggnostr: UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xfc' in position 3: ordinal not in range(128)
Taggnostr: this is what happens in py2
Taggnostr: it is odd, that's why python3 fixed it
Taggnostr: decoding only makes sense when you do bytes -> unicode, and
    encoding only when you do unicode -> bytes, but when the unicode
    type was added it was supposed to work where str worked, so it
    needed to have the same methods (even though they only work with
    ascii and create problems in all other cases)
Taggnostr: so if you get .decode/.encode mixed up, python will try to
    convert the input in a type that can be decoded/encoded, using
    ascii as default encoding
rouilj: gotcha.  AFAICT woosh, xapian mysql, sqlite and postgres full
    text search all support unicode data.
rouilj: looking at test/test_indexer I don't see any unicode strings.
Taggnostr: in that case it would be better to use unicode throughout,
    but I'm afraid it might break things if they already contain bytes
    and we start passing unicode instead
Taggnostr: it might be better to leave python2 alone and focus on
    python3, where everything should already work
rouilj: "they already contain" what is they? The index?
Taggnostr: yes, or the db
Taggnostr: but I'm not too familiar with it, maybe it can be rebuilt
    easily with only unicode strings with no side effects
rouilj: yeah same here. Databases are a black art. I run them but
    don;t program them. I usually leave he db stuff to Ralf.
Taggnostr: having everything unicode is probably the best option,
    second best is having everything bytes, having a mix of the two
    might cause more problems
Taggnostr: so if we start decoding stuff now and get unicode, either
    we can also get everything else as unicode too, or we can just
    reencode before adding to the index/db (which is what I think is
    already happening)
rouilj: So if I add some test cases for unicode (e.g
    add_text(u'Sprunge") and then do a find on the same text I should
    get a hit in python3, but fail on python 2. Is that a correct
    asessment?
Taggnostr: on python3 u'text' and 'text' should be equivalent (at
    least from py 3.3+)
Taggnostr: the tests now have add_text('text') -- this is testing with
    bytes on python2 and with unicode on python3
rouilj: but tha would also be true on python2 right? u'text' can make
    the ascii transition without a failure.
Taggnostr: that depends on how you save the file :)
Taggnostr: if you write u'Sprünge' in the .py file, the ü will be
    represented as a different sequence of bytes depending on what
    encoding you are using to save the file with
Taggnostr: you can specify the encoding with a comment at the top of
    the file
Taggnostr: or, you an just use 'Spr\xfcnge' and keep the source ascii
rouilj: #-*- encoding: utf-8 -*-
Taggnostr: yep, if you add this and save the file as utf-8 (no bom
    needed), then you can write u'Sprünge' directly in the .py file
Taggnostr: keep in mind that utf8 and iso-8859-1 are supersets of ascii
Taggnostr: so if you keep the source ascii-only, it will always work
rouilj: what happens if I add text Spr\xfcnge and search for sprunge,
    what do you expect will happen?
Taggnostr: if you use non-ascii characters (like ü) then you have to
    tell python what encoding have you used to save the file, using
    #-*- encoding: utf-8 -*- (and of course they must match)
rouilj: well I was going to keep it all in ascii using \xfc for the
    umlatted u
Taggnostr: >>> u'Spr\u00FCnge' == u'Spr\xfcnge' == u'Sprünge'
Taggnostr: True
rouilj: but not sprunge (regular u not umlauted)
Taggnostr: these are all different ways of spelling the same things,
    the first two ways are ascii-only so they work with
    ascii/utf8/iso-8859-1, the third is non-ascii so you have to tell
    python what encoding you are using in the file
Taggnostr: >>> u'Sprünge' == u'Sprunge'
Taggnostr: False
Taggnostr: unless the indexer strips diacritics
msg6783 Author: [hidden] (rouilj) Date: 2019-10-30 01:40
Commit rev5960:0db2621b6fee has a new test for unicode strings in the indexer. For python2, the test is marked xfail, but I have passing tests for whoosh and sqlite.

indexer (native) and xapian both fail. Native crashes while trying
to convert the unicode string to python2 string. Xapian just fails
to find the word in the index.

Commit is running through CI at: https://travis-ci.org/roundup-tracker/roundup/builds/604758991
msg6785 Author: [hidden] (rouilj) Date: 2019-10-30 22:07
Final fix in rev5964:5bf7b5debb09.

Works for all indexers in python2 and python 3.

Closing.
History
Date User Action Args
2019-10-30 22:07:06rouiljsetstatus: open -> fixed
resolution: fixed
messages: + msg6785
2019-10-30 01:40:57rouiljsetmessages: + msg6783
2019-10-30 00:53:22rouiljsetmessages: + msg6782
2019-10-26 19:32:06ezio.melottisetmessages: + msg6779
2019-10-26 19:21:25ezio.melottisetmessages: + msg6778
2019-10-11 01:27:40rouiljsetassignee: richard -> rouilj
messages: + msg6729
nosy: + ezio.melotti
2019-04-29 00:45:08rouiljsetnosy: + rouilj
messages: + msg6469
2019-04-03 10:46:07pefusetnosy: + pefu
messages: + msg6451
2016-06-26 19:29:22rouiljsetkeywords: + patch
type: behavior
2005-10-31 16:33:51anonymouscreate