Roundup Tracker - Issues

Message6778

Author ezio.melotti
Recipients ezio.melotti, pefu, richard, rouilj
Date 2019-10-26.19:21:25
Message-id <1572117685.62.0.572766142679.issue1344046@roundup.psfhosted.org>
In-reply-to
msg2047 suggests to change the add_text() function in roundup/backends/indexer_xapian.py to decode the text before feeding it to re.finditer() and reencoding the words afterwards (implying text is initially bytes).  

add_text() has changed a bit now, but the call to re.finditer() is still there.  At least on Python 3, the text it receives already seem to be unicode, so the suggested decoding/reencoding is no longer needed.  If it was bytes, the re.finditer() will break because the regex is unicode (and incompatible with bytes).  In Python 2 however, the args might still be bytes (depending on the args sent by the caller), and in that case it would need to be decoded, either in add_text() (probably by using b2s()), or before they get passed to add_text().  add_text() gets called in a number of places (e.g. roundup/backends/rdbms_common.py) and seems to have some tests (in test/test_indexer.py).

Note that the identifier and words are encoded into bytes before sending them to xapian (this might not be necessary, at least on Python 3)

I think in Python 3 the values passed to add_text() should be unicode-only and bytes should be rejected (even if this might be tricky with a shared codebase).  On Python 2 it should support both for backward compatibility, and if the args are bytes they should be decoded before being passed to re.finditer().  If xapian accepts and works with unicode, it might be better to pass unicode in both Py 2 and 3, if not, the identifier and words should be encoded (as it already happens with s2b).  Also consider using .casefold() instead of lower().  Tests should be improved by adding both unicode and bytes args (if supported), and non=ascii text.
History
Date User Action Args
2019-10-26 19:21:25ezio.melottisetmessageid: <1572117685.62.0.572766142679.issue1344046@roundup.psfhosted.org>
2019-10-26 19:21:25ezio.melottisetrecipients: + ezio.melotti, richard, rouilj, pefu
2019-10-26 19:21:25ezio.melottilinkissue1344046 messages
2019-10-26 19:21:25ezio.melotticreate