Roundup Tracker - Issues

Message6782

Author rouilj
Recipients ezio.melotti, pefu, richard, rouilj
Date 2019-10-30.00:53:22
Message-id <1572396802.56.0.822867569749.issue1344046@roundup.psfhosted.org>
In-reply-to
Taggnostr: I think on python3 everything should work, since it's all
    unicode already but on python2 it might be both, so things might
    not work when it's bytes
rouilj: will do. So on python3, it should just work (assuming index
    can handle unicode)
Taggnostr: on python3 it will also break if it's bytes fwiw all
    operations on text (splitting, slicing, upper/lowercase, etc)
    should be done on unicode only if the text is ascii-only, most
    operations happen to work on bytes too, but if it isn't things
    will break
rouilj: you mention using b2s(text) on python2, but b2s is a no-op in
    python 2.
Taggnostr: maybe it's s2u?
rouilj: wouldn't I want to do an s2u pn python2 to make a unicode
    string then mangle it with case, slicing etc.
rouilj: ok. that makes a little more sense.
Taggnostr: not sure what the name is, but it should decode the input
    if it's bytes, and do nothing if it's unicode
rouilj: def s2u(s, errors='strict'):
rouilj:     """Convert a string object to a Unicode string."""
Taggnostr: and this does nothing if s is already unicode, right?
rouilj: actually it doesn't check: 
rouilj:     if _py3:
rouilj:         return s
rouilj:     else:
rouilj:         return unicode(s, 'utf-8', errors)
Taggnostr: so maybe it's not what we want
rouilj: however in python2, doc in roundup/anypy/string.py says that
    unicode is rarely used except for passing to third parties that
    require unicode.
Taggnostr: if you try to decode a unicode string in python2, python
    will implicitly try to encode it using ascii, and then decode it
    again
Taggnostr: >>> u'Sprünge'.decode('utf-8')
Taggnostr: UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xfc' in position 3: ordinal not in range(128)
Taggnostr: this is what happens in py2
Taggnostr: it is odd, that's why python3 fixed it
Taggnostr: decoding only makes sense when you do bytes -> unicode, and
    encoding only when you do unicode -> bytes, but when the unicode
    type was added it was supposed to work where str worked, so it
    needed to have the same methods (even though they only work with
    ascii and create problems in all other cases)
Taggnostr: so if you get .decode/.encode mixed up, python will try to
    convert the input in a type that can be decoded/encoded, using
    ascii as default encoding
rouilj: gotcha.  AFAICT woosh, xapian mysql, sqlite and postgres full
    text search all support unicode data.
rouilj: looking at test/test_indexer I don't see any unicode strings.
Taggnostr: in that case it would be better to use unicode throughout,
    but I'm afraid it might break things if they already contain bytes
    and we start passing unicode instead
Taggnostr: it might be better to leave python2 alone and focus on
    python3, where everything should already work
rouilj: "they already contain" what is they? The index?
Taggnostr: yes, or the db
Taggnostr: but I'm not too familiar with it, maybe it can be rebuilt
    easily with only unicode strings with no side effects
rouilj: yeah same here. Databases are a black art. I run them but
    don;t program them. I usually leave he db stuff to Ralf.
Taggnostr: having everything unicode is probably the best option,
    second best is having everything bytes, having a mix of the two
    might cause more problems
Taggnostr: so if we start decoding stuff now and get unicode, either
    we can also get everything else as unicode too, or we can just
    reencode before adding to the index/db (which is what I think is
    already happening)
rouilj: So if I add some test cases for unicode (e.g
    add_text(u'Sprunge") and then do a find on the same text I should
    get a hit in python3, but fail on python 2. Is that a correct
    asessment?
Taggnostr: on python3 u'text' and 'text' should be equivalent (at
    least from py 3.3+)
Taggnostr: the tests now have add_text('text') -- this is testing with
    bytes on python2 and with unicode on python3
rouilj: but tha would also be true on python2 right? u'text' can make
    the ascii transition without a failure.
Taggnostr: that depends on how you save the file :)
Taggnostr: if you write u'Sprünge' in the .py file, the ü will be
    represented as a different sequence of bytes depending on what
    encoding you are using to save the file with
Taggnostr: you can specify the encoding with a comment at the top of
    the file
Taggnostr: or, you an just use 'Spr\xfcnge' and keep the source ascii
rouilj: #-*- encoding: utf-8 -*-
Taggnostr: yep, if you add this and save the file as utf-8 (no bom
    needed), then you can write u'Sprünge' directly in the .py file
Taggnostr: keep in mind that utf8 and iso-8859-1 are supersets of ascii
Taggnostr: so if you keep the source ascii-only, it will always work
rouilj: what happens if I add text Spr\xfcnge and search for sprunge,
    what do you expect will happen?
Taggnostr: if you use non-ascii characters (like ü) then you have to
    tell python what encoding have you used to save the file, using
    #-*- encoding: utf-8 -*- (and of course they must match)
rouilj: well I was going to keep it all in ascii using \xfc for the
    umlatted u
Taggnostr: >>> u'Spr\u00FCnge' == u'Spr\xfcnge' == u'Sprünge'
Taggnostr: True
rouilj: but not sprunge (regular u not umlauted)
Taggnostr: these are all different ways of spelling the same things,
    the first two ways are ascii-only so they work with
    ascii/utf8/iso-8859-1, the third is non-ascii so you have to tell
    python what encoding you are using in the file
Taggnostr: >>> u'Sprünge' == u'Sprunge'
Taggnostr: False
Taggnostr: unless the indexer strips diacritics
History
Date User Action Args
2019-10-30 00:53:22rouiljsetmessageid: <1572396802.56.0.822867569749.issue1344046@roundup.psfhosted.org>
2019-10-30 00:53:22rouiljsetrecipients: + rouilj, richard, pefu, ezio.melotti
2019-10-30 00:53:22rouiljlinkissue1344046 messages
2019-10-30 00:53:22rouiljcreate