Roundup Tracker - Issues

Issue 2551115

classification
Use utf8mb4 as a default for MySQL instead of utf8
Type: rfe Severity: normal
Components: Documentation, Database Versions: devel
process
Status: pending remind
:
: rouilj : rouilj, schlatterbeck, whunger
Priority: :

Created on 2021-03-10 14:51 by whunger, last changed 2024-04-07 02:53 by rouilj.

Messages
msg7090 Author: [hidden] (whunger) Date: 2021-03-10 14:51
I just stumbled over the upgrading instructions to 2.x for MySQL users (https://www.roundup-tracker.org/docs/upgrading.html#python-2-mysql-users-must-read), to which i would like to propose an improvement.

I would not advise to set the MySQL database encoding to utf, but to utf8mb4 instead. Utf8 in MySQL is limited to use only up to 3 bytes per character, and thus cannot store all characters that you would expect maybe. Especially Emojis will not work, which is a pity, as these have a high probability of arriving through the Email interface :-)

https://stackoverflow.com/questions/30074492/what-is-the-difference-between-utf8mb4-and-utf8-charsets-in-mysql
msg7094 Author: [hidden] (rouilj) Date: 2021-03-10 15:36
Hi Werner:

In message <1615387874.91.0.944369845492.issue2551115@roundup.psfhosted.org>,
Werner Hunger writes:
>I just stumbled over the upgrading instructions to 2.x for MySQL
>users
>(https://www.roundup-tracker.org/docs/upgrading.html#python-2-mysql-users-must-read),

Have you successfully upgraded? Those docs were written with minimal if
any testing. Nobody on the core team runs mysql. Any feedback on those
docs would be appreciated.

>to which i would like to propose an improvement.
>
>I would not advise to set the MySQL database encoding to utf, but to
>utf8mb4 instead. Utf8 in MySQL is limited to use only up to 3 bytes
>per character, and thus cannot store all characters that you would
>expect maybe. Especially Emojis will not work, which is a pity, as
>these have a high probability of arriving through the Email interface

This makes sense. Nobody on the primary team runs mysql so we weren't
up on the nuances. I have updated the docs and changed the config
variable to use utf8mb4. The diffs are below:

diff -r 6a6b4651be1f doc/upgrading.txt
--- a/doc/upgrading.txt Thu Mar 04 12:55:21 2021 +0100
+++ b/doc/upgrading.txt Wed Mar 10 10:30:03 2021 -0500
@@ -77,10 +77,10 @@
 python3. See `Python 3 support`_.

 In an upgraded ``config.ini`` (see next section) the ``[rdbms]``
-section has a key ``mysql_charset`` set by default to ``utf-8``.
-
-It should be possible to change utf-8 to any mysql charset. So if you
-know what charset is enabled (e.g. via a setting in ~roundup/.my.cnf,
+section has a key ``mysql_charset`` set by default to ``utf8mb4``.
+
+It should be possible to change ``utf8mb4`` to any mysql charset. So
+if you know what charset is enabled (e.g. via a setting in ~roundup/.my.cnf,
 or the default charset for the database) you can set it in
 ``config.ini`` and not need to covert the database. However the
 underlying issues with misconverted data and bad searches will still
diff -r 6a6b4651be1f roundup/configuration.py
--- a/roundup/configuration.py  Thu Mar 04 12:55:21 2021 +0100
+++ b/roundup/configuration.py  Wed Mar 10 10:30:03 2021 -0500
@@ -1048,7 +1048,7 @@
         (NullableOption, 'read_default_group', 'roundup',
             "Name of the group to use in the MySQL defaults file (.my.cnf).\n"
             "Only used in MySQL connections."),
-        (Option, 'mysql_charset', 'utf8',
+        (Option, 'mysql_charset', 'utf8mb4',
             "Charset to use for mysql connection,\n"
             "use 'default' for the mysql default, no charset option\n"
             "is used when creating the connection in that case.\n"

committed in rev bd84f43e1d13.

Does this fix your request?
msg7121 Author: [hidden] (rouilj) Date: 2021-03-18 01:15
Hi Werner, does my change address your issue?

-- rouilj
msg7130 Author: [hidden] (whunger) Date: 2021-03-22 19:38
Sorry for answering so late.

I think your changes are correct as such, but i'm not sure about
back_mysql in total, and i did not do any testing yet. db_create() for
example still contains "CHARACTER SET utf8" for "CREATE DATABASE", and
i'm not sure whether a proper collation definition should also be part
of database creation/configuration.

The problem is, our version of Roundup is a heavily patched 1.4.16 for
which the MySQL-dependent code differs substantially from the upstream
code, so i can't compare their behaviour without some effort. Our
version of the backend has been changed to work around the n+1 problem
for tree-like and other strongly linked data structures that we have
in our tracker, and that is now up with data from 2004 on (about 1.5
Mio messages). The original code (in combination with our page
templates) generates way too much database queries for one web request
- MySQL is not able to handle this load with decent performance.
Therefor we added support for bulk node prefetching with node id lists
anticipated from preceeding query results, among other things.

The reason i was looking into the current docs was to start estimating
whether it would be feasible to migrate and merge our backend changes
to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like
to, i can keep you posted once we decide to upgrade and the merge has
been done.

Best regards - Werner.
--
PGP-Schlüssel mit "Betreff: Send public key" abrufbar.
msg7131 Author: [hidden] (rouilj) Date: 2021-03-22 20:30
Hi Werner:

In message <20210322193839.Horde.X35YQqE4kZiE_wnnKiMRRjO@www.rxd.de>,
Werner Hunger writes:
>I think your changes are correct as such, but i'm not sure about
>back_mysql in total, and i did not do any testing yet. db_create() for
>example still contains "CHARACTER SET utf8" for "CREATE DATABASE", and
>i'm not sure whether a proper collation definition should also be part
>of database creation/configuration.

Understood. Would you recommend using the character set from the
config (mysql_charset) in the CREATE DATABASE?

>The problem is, our version of Roundup is a heavily patched 1.4.16 for
>which the MySQL-dependent code differs substantially from the upstream
>code, so i can't compare their behaviour without some effort. Our
>version of the backend has been changed to work around the n+1 problem
>for tree-like and other strongly linked data structures that we have
>in our tracker, and that is now up with data from 2004 on (about 1.5
>Mio messages). The original code (in combination with our page
>templates) generates way too much database queries for one web request
>- MySQL is not able to handle this load with decent performance.
>Therefor we added support for bulk node prefetching with node id lists
>anticipated from preceeding query results, among other things.

Hmm that sounds interesting. Might some of the ideas in:
https://issues.roundup-tracker.org/issue2550514 help reduce the need
for your changes?

>The reason i was looking into the current docs was to start estimating
>whether it would be feasible to migrate and merge our backend changes
>to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like
>to, i can keep you posted once we decide to upgrade and the merge has
>been done.

I would very much like that. Also please consider joining and keeping
roundup-devel at lists.sourceforge.net in the loop.
msg7133 Author: [hidden] (schlatterbeck) Date: 2021-03-23 07:21
On Mon, Mar 22, 2021 at 07:38:41PM +0000, Werner Hunger wrote:
> Mio messages). The original code (in combination with our page
> templates) generates way too much database queries for one web request
> - MySQL is not able to handle this load with decent performance.
> Therefor we added support for bulk node prefetching with node id lists
> anticipated from preceeding query results, among other things.

There is a filter_iter instead of filter in back_anydbm (that also works
for mysql, at least it's tested :-)

For the typical loop

for id in db.someclass.filter(...):
    n = db.someclass.getnode(id)

This will not produce a database query for each getnode call when using
filter_iter instead of filter because filter_iter fetches the whole row
(not just the id) and pre-populates the cache in roundup. So the getnode
call already fetches from the cache.

This is not (yet?) used in the templates but it may speed up things for
some workloads.

> The reason i was looking into the current docs was to start estimating
> whether it would be feasible to migrate and merge our backend changes
> to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like
> to, i can keep you posted once we decide to upgrade and the merge has
> been done.

Yes, I'd also like to know what you decide.

Ralf
-- 
Dr. Ralf Schlatterbeck                  Tel:   +43/2243/26465-16
Open Source Consulting                  www:   www.runtux.com
Reichergasse 131, A-3411 Weidling       email: office@runtux.com
msg7718 Author: [hidden] (rouilj) Date: 2023-02-20 00:06
To partly address this.

The charset is selectable. By default as of 2022-07-13 version 2.2.0
collation is set to: utf8_general_ci by default for new db's.

Werner did you complete your upgrade?
msg7776 Author: [hidden] (rouilj) Date: 2023-05-28 13:53
Werner, I realize it's been a while but have you been able to upgrade to a newer Roundup?
Was Ralf's suggestion to use filter_iter helpful?

-- rouilj
msg7980 Author: [hidden] (rouilj) Date: 2024-04-07 02:53
Hello Werner:

I have made some changes to Roundup that I think addresses your issues.

From the changelog note:
===
Fix issues with utf8 support in Roundup. By default using:

 utf8mb4 charset
 utf8mb4_unicode_ci collation (case insensitive)
 utf8mb4_0900_ci collation (case sensitive)

which are settable from config.ini. Sadly I couldn't come up with a way
to mange these from one parameter. Doing a compatibility lookup table
would have increased the maintenance burden and have me chasing MySQL
changes. So I opted for the easy path and have the admins (with more
MySQL experience) make the choices.

Conversion directions added to upgrading.txt. I don't have any good
testing for this. I was able to generate utf8/utf8mb3 tables and load
a little data and convert. However this is a poor substitute for a
conversion on a working tracker 8-(.
===

You can see the changes at:

https://sourceforge.net/p/roundup/code/ci/8b31893f5930cbea37031d49bd08a631d1b6b654/

If you can please review them and see if I missed anything. Your expertise
in this area would be valuable.

Thanks and have a great week.
History
Date User Action Args
2024-04-07 02:53:04rouiljsetresolution: remind
messages: + msg7980
2023-05-28 13:53:10rouiljsetmessages: + msg7776
2023-02-20 00:06:01rouiljsetmessages: + msg7718
2021-03-23 07:21:56schlatterbecksetmessages: + msg7133
2021-03-22 20:30:44rouiljsetmessages: + msg7131
2021-03-22 19:38:41whungersetmessages: + msg7130
2021-03-18 01:15:11rouiljsetmessages: + msg7121
2021-03-12 22:42:22rouiljsetstatus: open -> pending
2021-03-11 07:56:19schlatterbecksetnosy: + schlatterbeck
2021-03-10 15:39:53rouiljsetstatus: new -> open
assignee: rouilj
2021-03-10 15:36:59rouiljsetnosy: + rouilj
messages: + msg7094
2021-03-10 14:51:14whungercreate