Issue 1344046: Search for "All text" can't find some Unicode words

classification

Title:	Search for "All text" can't find some Unicode words
Type:	behavior	Severity:	normal
Components:	Database	Versions:

process

Status:	fixed	Resolution:	fixed
Dependencies		Superseder:
Assigned To:	rouilj	Nosy List:	ezio.melotti, pefu, richard, rouilj
Priority:	normal	Keywords:	patch

Created on 2005-10-31 16:33 by anonymous, last changed 2019-10-30 22:07 by rouilj.

Messages
msg2046	Author: [hidden] (anonymous)	Date: 2005-10-31 16:33
The fulltext search implemented in backends\indexer_rdbms.py is not able to find words having specific unicode characters in them. One such character is the german 'u umlaut' ('ü'), which does not survive the upper() statement in find(). E. g., if you search for 'Sprünge', wordlist first contains 'SPR\xc3\x9cNGE', then 'SPR\xc3\x8cNGE'. To fix this, i replaced line 82: 82c82,83 < l = [word.upper() for word in wordlist if 26 > len(word) > 1] --- > l = [unicode(word, "utf-8", "replace").upper().encode("utf-8", "replace") > for word in wordlist if 26 > len(word) > 1] woe@gmx.net
msg2047	Author: [hidden] (anonymous)	Date: 2007-01-29 18:25
Logged In: NO Words with UTF-8 characters are wrongly detected in indexer_ backends. UTF-8 characters splits words now. Original in indexer_xapian.py: for match in re.finditer(r'\b\w{2,25}\b', text.upper()): word = match.group(0) OK: for match in re.finditer(r'\b\w{2,25}\b', unicode(text, "utf-8","replace").upper(), re.UNICODE): word = match.group(0).encode("utf-8", "replace")
msg6451	Author: [hidden] (pefu)	Date: 2019-04-03 10:46
Today a coworker surprised me: Indeed he was unable to find a certain issue in our database because the search term he used contained a german umlaut. Is there any technical reason why the proposed patch has not been applied in the roundup code base during the past twelve years? Best regards, Peter Funk
msg6469	Author: [hidden] (rouilj)	Date: 2019-04-29 00:45
Hi Peter. I don't know a technical reason why the code wasn't changed. This patch came in when I wasn't involved with roundup. Full text indexers like xapian or whoosh I thought were meant to handle this better than native. From a process level, there are no tests associated with this patch. Support for this needs testing on all rdbms back ends at minimum. Also support for anydbm should be added, or documentation updates are needed to explain this change.
msg6729	Author: [hidden] (rouilj)	Date: 2019-10-11 01:27
Ezio: Can you take a look at this ticket and patch. You have more knowledge about unicode and conversions than I do. If you say it looks good and it passes existing tests (or if you hve a suggestion on how to add a test for it) I'll commit for 2.0.0 alpha. -- rouilj
msg6778	Author: [hidden] (ezio.melotti)	Date: 2019-10-26 19:21
msg2047 suggests to change the add_text() function in roundup/backends/indexer_xapian.py to decode the text before feeding it to re.finditer() and reencoding the words afterwards (implying text is initially bytes). add_text() has changed a bit now, but the call to re.finditer() is still there. At least on Python 3, the text it receives already seem to be unicode, so the suggested decoding/reencoding is no longer needed. If it was bytes, the re.finditer() will break because the regex is unicode (and incompatible with bytes). In Python 2 however, the args might still be bytes (depending on the args sent by the caller), and in that case it would need to be decoded, either in add_text() (probably by using b2s()), or before they get passed to add_text(). add_text() gets called in a number of places (e.g. roundup/backends/rdbms_common.py) and seems to have some tests (in test/test_indexer.py). Note that the identifier and words are encoded into bytes before sending them to xapian (this might not be necessary, at least on Python 3) I think in Python 3 the values passed to add_text() should be unicode-only and bytes should be rejected (even if this might be tricky with a shared codebase). On Python 2 it should support both for backward compatibility, and if the args are bytes they should be decoded before being passed to re.finditer(). If xapian accepts and works with unicode, it might be better to pass unicode in both Py 2 and 3, if not, the identifier and words should be encoded (as it already happens with s2b). Also consider using .casefold() instead of lower(). Tests should be improved by adding both unicode and bytes args (if supported), and non=ascii text.
msg6779	Author: [hidden] (ezio.melotti)	Date: 2019-10-26 19:32
msg2046 suggests to change the find() function in roundup/backends/indexer_rdbms.py to decode the words before passing them to the db (implying the wordlist initially contains bytes). As in my previous message, in Python 3 the word list should already be a list of unicode strings, but on Python 2 it might be either a list of bytes, or a list of unicode strings. Decoding the bytes with b2s in Python 2 should be ok, as long as the db can accept unicode. Tests should be added to ensure that find() works with both bytes and unicode in Python 2 and that it handles non-ascii words correctly on both versions.
msg6782	Author: [hidden] (rouilj)	Date: 2019-10-30 00:53
Taggnostr: I think on python3 everything should work, since it's all unicode already but on python2 it might be both, so things might not work when it's bytes rouilj: will do. So on python3, it should just work (assuming index can handle unicode) Taggnostr: on python3 it will also break if it's bytes fwiw all operations on text (splitting, slicing, upper/lowercase, etc) should be done on unicode only if the text is ascii-only, most operations happen to work on bytes too, but if it isn't things will break rouilj: you mention using b2s(text) on python2, but b2s is a no-op in python 2. Taggnostr: maybe it's s2u? rouilj: wouldn't I want to do an s2u pn python2 to make a unicode string then mangle it with case, slicing etc. rouilj: ok. that makes a little more sense. Taggnostr: not sure what the name is, but it should decode the input if it's bytes, and do nothing if it's unicode rouilj: def s2u(s, errors='strict'): rouilj: """Convert a string object to a Unicode string.""" Taggnostr: and this does nothing if s is already unicode, right? rouilj: actually it doesn't check: rouilj: if _py3: rouilj: return s rouilj: else: rouilj: return unicode(s, 'utf-8', errors) Taggnostr: so maybe it's not what we want rouilj: however in python2, doc in roundup/anypy/string.py says that unicode is rarely used except for passing to third parties that require unicode. Taggnostr: if you try to decode a unicode string in python2, python will implicitly try to encode it using ascii, and then decode it again Taggnostr: >>> u'Sprünge'.decode('utf-8') Taggnostr: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 3: ordinal not in range(128) Taggnostr: this is what happens in py2 Taggnostr: it is odd, that's why python3 fixed it Taggnostr: decoding only makes sense when you do bytes -> unicode, and encoding only when you do unicode -> bytes, but when the unicode type was added it was supposed to work where str worked, so it needed to have the same methods (even though they only work with ascii and create problems in all other cases) Taggnostr: so if you get .decode/.encode mixed up, python will try to convert the input in a type that can be decoded/encoded, using ascii as default encoding rouilj: gotcha. AFAICT woosh, xapian mysql, sqlite and postgres full text search all support unicode data. rouilj: looking at test/test_indexer I don't see any unicode strings. Taggnostr: in that case it would be better to use unicode throughout, but I'm afraid it might break things if they already contain bytes and we start passing unicode instead Taggnostr: it might be better to leave python2 alone and focus on python3, where everything should already work rouilj: "they already contain" what is they? The index? Taggnostr: yes, or the db Taggnostr: but I'm not too familiar with it, maybe it can be rebuilt easily with only unicode strings with no side effects rouilj: yeah same here. Databases are a black art. I run them but don;t program them. I usually leave he db stuff to Ralf. Taggnostr: having everything unicode is probably the best option, second best is having everything bytes, having a mix of the two might cause more problems Taggnostr: so if we start decoding stuff now and get unicode, either we can also get everything else as unicode too, or we can just reencode before adding to the index/db (which is what I think is already happening) rouilj: So if I add some test cases for unicode (e.g add_text(u'Sprunge") and then do a find on the same text I should get a hit in python3, but fail on python 2. Is that a correct asessment? Taggnostr: on python3 u'text' and 'text' should be equivalent (at least from py 3.3+) Taggnostr: the tests now have add_text('text') -- this is testing with bytes on python2 and with unicode on python3 rouilj: but tha would also be true on python2 right? u'text' can make the ascii transition without a failure. Taggnostr: that depends on how you save the file :) Taggnostr: if you write u'Sprünge' in the .py file, the ü will be represented as a different sequence of bytes depending on what encoding you are using to save the file with Taggnostr: you can specify the encoding with a comment at the top of the file Taggnostr: or, you an just use 'Spr\xfcnge' and keep the source ascii rouilj: #-- encoding: utf-8 -- Taggnostr: yep, if you add this and save the file as utf-8 (no bom needed), then you can write u'Sprünge' directly in the .py file Taggnostr: keep in mind that utf8 and iso-8859-1 are supersets of ascii Taggnostr: so if you keep the source ascii-only, it will always work rouilj: what happens if I add text Spr\xfcnge and search for sprunge, what do you expect will happen? Taggnostr: if you use non-ascii characters (like ü) then you have to tell python what encoding have you used to save the file, using #-- encoding: utf-8 -- (and of course they must match) rouilj: well I was going to keep it all in ascii using \xfc for the umlatted u Taggnostr: >>> u'Spr\u00FCnge' == u'Spr\xfcnge' == u'Sprünge' Taggnostr: True rouilj: but not sprunge (regular u not umlauted) Taggnostr: these are all different ways of spelling the same things, the first two ways are ascii-only so they work with ascii/utf8/iso-8859-1, the third is non-ascii so you have to tell python what encoding you are using in the file Taggnostr: >>> u'Sprünge' == u'Sprunge' Taggnostr: False Taggnostr: unless the indexer strips diacritics
msg6783	Author: [hidden] (rouilj)	Date: 2019-10-30 01:40
Commit rev5960:0db2621b6fee has a new test for unicode strings in the indexer. For python2, the test is marked xfail, but I have passing tests for whoosh and sqlite. indexer (native) and xapian both fail. Native crashes while trying to convert the unicode string to python2 string. Xapian just fails to find the word in the index. Commit is running through CI at: https://travis-ci.org/roundup-tracker/roundup/builds/604758991
msg6785	Author: [hidden] (rouilj)	Date: 2019-10-30 22:07
Final fix in rev5964:5bf7b5debb09. Works for all indexers in python2 and python 3. Closing.

History
Date	User	Action	Args
2019-10-30 22:07:06	rouilj	set	status: open -> fixed resolution: fixed messages: + msg6785
2019-10-30 01:40:57	rouilj	set	messages: + msg6783
2019-10-30 00:53:22	rouilj	set	messages: + msg6782
2019-10-26 19:32:06	ezio.melotti	set	messages: + msg6779
2019-10-26 19:21:25	ezio.melotti	set	messages: + msg6778
2019-10-11 01:27:40	rouilj	set	assignee: richard -> rouilj messages: + msg6729 nosy: + ezio.melotti
2019-04-29 00:45:08	rouilj	set	nosy: + rouilj messages: + msg6469
2019-04-03 10:46:07	pefu	set	nosy: + pefu messages: + msg6451
2016-06-26 19:29:22	rouilj	set	keywords: + patch type: behavior
2005-10-31 16:33:51	anonymous	create

Roundup Tracker - Issues

Issue 1344046