Message6782
Taggnostr: I think on python3 everything should work, since it's all
unicode already but on python2 it might be both, so things might
not work when it's bytes
rouilj: will do. So on python3, it should just work (assuming index
can handle unicode)
Taggnostr: on python3 it will also break if it's bytes fwiw all
operations on text (splitting, slicing, upper/lowercase, etc)
should be done on unicode only if the text is ascii-only, most
operations happen to work on bytes too, but if it isn't things
will break
rouilj: you mention using b2s(text) on python2, but b2s is a no-op in
python 2.
Taggnostr: maybe it's s2u?
rouilj: wouldn't I want to do an s2u pn python2 to make a unicode
string then mangle it with case, slicing etc.
rouilj: ok. that makes a little more sense.
Taggnostr: not sure what the name is, but it should decode the input
if it's bytes, and do nothing if it's unicode
rouilj: def s2u(s, errors='strict'):
rouilj: """Convert a string object to a Unicode string."""
Taggnostr: and this does nothing if s is already unicode, right?
rouilj: actually it doesn't check:
rouilj: if _py3:
rouilj: return s
rouilj: else:
rouilj: return unicode(s, 'utf-8', errors)
Taggnostr: so maybe it's not what we want
rouilj: however in python2, doc in roundup/anypy/string.py says that
unicode is rarely used except for passing to third parties that
require unicode.
Taggnostr: if you try to decode a unicode string in python2, python
will implicitly try to encode it using ascii, and then decode it
again
Taggnostr: >>> u'Sprünge'.decode('utf-8')
Taggnostr: UnicodeEncodeError: 'ascii' codec can't encode character
u'\xfc' in position 3: ordinal not in range(128)
Taggnostr: this is what happens in py2
Taggnostr: it is odd, that's why python3 fixed it
Taggnostr: decoding only makes sense when you do bytes -> unicode, and
encoding only when you do unicode -> bytes, but when the unicode
type was added it was supposed to work where str worked, so it
needed to have the same methods (even though they only work with
ascii and create problems in all other cases)
Taggnostr: so if you get .decode/.encode mixed up, python will try to
convert the input in a type that can be decoded/encoded, using
ascii as default encoding
rouilj: gotcha. AFAICT woosh, xapian mysql, sqlite and postgres full
text search all support unicode data.
rouilj: looking at test/test_indexer I don't see any unicode strings.
Taggnostr: in that case it would be better to use unicode throughout,
but I'm afraid it might break things if they already contain bytes
and we start passing unicode instead
Taggnostr: it might be better to leave python2 alone and focus on
python3, where everything should already work
rouilj: "they already contain" what is they? The index?
Taggnostr: yes, or the db
Taggnostr: but I'm not too familiar with it, maybe it can be rebuilt
easily with only unicode strings with no side effects
rouilj: yeah same here. Databases are a black art. I run them but
don;t program them. I usually leave he db stuff to Ralf.
Taggnostr: having everything unicode is probably the best option,
second best is having everything bytes, having a mix of the two
might cause more problems
Taggnostr: so if we start decoding stuff now and get unicode, either
we can also get everything else as unicode too, or we can just
reencode before adding to the index/db (which is what I think is
already happening)
rouilj: So if I add some test cases for unicode (e.g
add_text(u'Sprunge") and then do a find on the same text I should
get a hit in python3, but fail on python 2. Is that a correct
asessment?
Taggnostr: on python3 u'text' and 'text' should be equivalent (at
least from py 3.3+)
Taggnostr: the tests now have add_text('text') -- this is testing with
bytes on python2 and with unicode on python3
rouilj: but tha would also be true on python2 right? u'text' can make
the ascii transition without a failure.
Taggnostr: that depends on how you save the file :)
Taggnostr: if you write u'Sprünge' in the .py file, the ü will be
represented as a different sequence of bytes depending on what
encoding you are using to save the file with
Taggnostr: you can specify the encoding with a comment at the top of
the file
Taggnostr: or, you an just use 'Spr\xfcnge' and keep the source ascii
rouilj: #-*- encoding: utf-8 -*-
Taggnostr: yep, if you add this and save the file as utf-8 (no bom
needed), then you can write u'Sprünge' directly in the .py file
Taggnostr: keep in mind that utf8 and iso-8859-1 are supersets of ascii
Taggnostr: so if you keep the source ascii-only, it will always work
rouilj: what happens if I add text Spr\xfcnge and search for sprunge,
what do you expect will happen?
Taggnostr: if you use non-ascii characters (like ü) then you have to
tell python what encoding have you used to save the file, using
#-*- encoding: utf-8 -*- (and of course they must match)
rouilj: well I was going to keep it all in ascii using \xfc for the
umlatted u
Taggnostr: >>> u'Spr\u00FCnge' == u'Spr\xfcnge' == u'Sprünge'
Taggnostr: True
rouilj: but not sprunge (regular u not umlauted)
Taggnostr: these are all different ways of spelling the same things,
the first two ways are ascii-only so they work with
ascii/utf8/iso-8859-1, the third is non-ascii so you have to tell
python what encoding you are using in the file
Taggnostr: >>> u'Sprünge' == u'Sprunge'
Taggnostr: False
Taggnostr: unless the indexer strips diacritics |
|
Date |
User |
Action |
Args |
2019-10-30 00:53:22 | rouilj | set | messageid: <1572396802.56.0.822867569749.issue1344046@roundup.psfhosted.org> |
2019-10-30 00:53:22 | rouilj | set | recipients:
+ rouilj, richard, pefu, ezio.melotti |
2019-10-30 00:53:22 | rouilj | link | issue1344046 messages |
2019-10-30 00:53:22 | rouilj | create | |
|