Issue 2551115: Use utf8mb4 as a default for MySQL instead of utf8

classification

Title:	Use utf8mb4 as a default for MySQL instead of utf8
Type:	rfe	Severity:	normal
Components:	Documentation, Database	Versions:	devel

process

Status:	fixed	Resolution:	fixed
Dependencies		Superseder:
Assigned To:	rouilj	Nosy List:	rouilj, schlatterbeck, whunger
Priority:		Keywords:

Created on 2021-03-10 14:51 by whunger, last changed 2024-07-13 22:41 by rouilj.

Messages
msg7090	Author: [hidden] (whunger)	Date: 2021-03-10 14:51
I just stumbled over the upgrading instructions to 2.x for MySQL users (https://www.roundup-tracker.org/docs/upgrading.html#python-2-mysql-users-must-read), to which i would like to propose an improvement. I would not advise to set the MySQL database encoding to utf, but to utf8mb4 instead. Utf8 in MySQL is limited to use only up to 3 bytes per character, and thus cannot store all characters that you would expect maybe. Especially Emojis will not work, which is a pity, as these have a high probability of arriving through the Email interface :-) https://stackoverflow.com/questions/30074492/what-is-the-difference-between-utf8mb4-and-utf8-charsets-in-mysql
msg7094	Author: [hidden] (rouilj)	Date: 2021-03-10 15:36
Hi Werner: In message <1615387874.91.0.944369845492.issue2551115@roundup.psfhosted.org>, Werner Hunger writes: >I just stumbled over the upgrading instructions to 2.x for MySQL >users >(https://www.roundup-tracker.org/docs/upgrading.html#python-2-mysql-users-must-read), Have you successfully upgraded? Those docs were written with minimal if any testing. Nobody on the core team runs mysql. Any feedback on those docs would be appreciated. >to which i would like to propose an improvement. > >I would not advise to set the MySQL database encoding to utf, but to >utf8mb4 instead. Utf8 in MySQL is limited to use only up to 3 bytes >per character, and thus cannot store all characters that you would >expect maybe. Especially Emojis will not work, which is a pity, as >these have a high probability of arriving through the Email interface This makes sense. Nobody on the primary team runs mysql so we weren't up on the nuances. I have updated the docs and changed the config variable to use utf8mb4. The diffs are below: diff -r 6a6b4651be1f doc/upgrading.txt --- a/doc/upgrading.txt Thu Mar 04 12:55:21 2021 +0100 +++ b/doc/upgrading.txt Wed Mar 10 10:30:03 2021 -0500 @@ -77,10 +77,10 @@ python3. See `Python 3 support`_. In an upgraded ``config.ini`` (see next section) the ``[rdbms]`` -section has a key ``mysql_charset`` set by default to ``utf-8``. - -It should be possible to change utf-8 to any mysql charset. So if you -know what charset is enabled (e.g. via a setting in ~roundup/.my.cnf, +section has a key ``mysql_charset`` set by default to ``utf8mb4``. + +It should be possible to change ``utf8mb4`` to any mysql charset. So +if you know what charset is enabled (e.g. via a setting in ~roundup/.my.cnf, or the default charset for the database) you can set it in ``config.ini`` and not need to covert the database. However the underlying issues with misconverted data and bad searches will still diff -r 6a6b4651be1f roundup/configuration.py --- a/roundup/configuration.py Thu Mar 04 12:55:21 2021 +0100 +++ b/roundup/configuration.py Wed Mar 10 10:30:03 2021 -0500 @@ -1048,7 +1048,7 @@ (NullableOption, 'read_default_group', 'roundup', "Name of the group to use in the MySQL defaults file (.my.cnf).\n" "Only used in MySQL connections."), - (Option, 'mysql_charset', 'utf8', + (Option, 'mysql_charset', 'utf8mb4', "Charset to use for mysql connection,\n" "use 'default' for the mysql default, no charset option\n" "is used when creating the connection in that case.\n" committed in rev bd84f43e1d13. Does this fix your request?
msg7121	Author: [hidden] (rouilj)	Date: 2021-03-18 01:15
Hi Werner, does my change address your issue? -- rouilj
msg7130	Author: [hidden] (whunger)	Date: 2021-03-22 19:38
Sorry for answering so late. I think your changes are correct as such, but i'm not sure about back_mysql in total, and i did not do any testing yet. db_create() for example still contains "CHARACTER SET utf8" for "CREATE DATABASE", and i'm not sure whether a proper collation definition should also be part of database creation/configuration. The problem is, our version of Roundup is a heavily patched 1.4.16 for which the MySQL-dependent code differs substantially from the upstream code, so i can't compare their behaviour without some effort. Our version of the backend has been changed to work around the n+1 problem for tree-like and other strongly linked data structures that we have in our tracker, and that is now up with data from 2004 on (about 1.5 Mio messages). The original code (in combination with our page templates) generates way too much database queries for one web request - MySQL is not able to handle this load with decent performance. Therefor we added support for bulk node prefetching with node id lists anticipated from preceeding query results, among other things. The reason i was looking into the current docs was to start estimating whether it would be feasible to migrate and merge our backend changes to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like to, i can keep you posted once we decide to upgrade and the merge has been done. Best regards - Werner. -- PGP-Schlüssel mit "Betreff: Send public key" abrufbar.
msg7131	Author: [hidden] (rouilj)	Date: 2021-03-22 20:30
Hi Werner: In message <20210322193839.Horde.X35YQqE4kZiE_wnnKiMRRjO@www.rxd.de>, Werner Hunger writes: >I think your changes are correct as such, but i'm not sure about >back_mysql in total, and i did not do any testing yet. db_create() for >example still contains "CHARACTER SET utf8" for "CREATE DATABASE", and >i'm not sure whether a proper collation definition should also be part >of database creation/configuration. Understood. Would you recommend using the character set from the config (mysql_charset) in the CREATE DATABASE? >The problem is, our version of Roundup is a heavily patched 1.4.16 for >which the MySQL-dependent code differs substantially from the upstream >code, so i can't compare their behaviour without some effort. Our >version of the backend has been changed to work around the n+1 problem >for tree-like and other strongly linked data structures that we have >in our tracker, and that is now up with data from 2004 on (about 1.5 >Mio messages). The original code (in combination with our page >templates) generates way too much database queries for one web request >- MySQL is not able to handle this load with decent performance. >Therefor we added support for bulk node prefetching with node id lists >anticipated from preceeding query results, among other things. Hmm that sounds interesting. Might some of the ideas in: https://issues.roundup-tracker.org/issue2550514 help reduce the need for your changes? >The reason i was looking into the current docs was to start estimating >whether it would be feasible to migrate and merge our backend changes >to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like >to, i can keep you posted once we decide to upgrade and the merge has >been done. I would very much like that. Also please consider joining and keeping roundup-devel at lists.sourceforge.net in the loop.
msg7133	Author: [hidden] (schlatterbeck)	Date: 2021-03-23 07:21
On Mon, Mar 22, 2021 at 07:38:41PM +0000, Werner Hunger wrote: > Mio messages). The original code (in combination with our page > templates) generates way too much database queries for one web request > - MySQL is not able to handle this load with decent performance. > Therefor we added support for bulk node prefetching with node id lists > anticipated from preceeding query results, among other things. There is a filter_iter instead of filter in back_anydbm (that also works for mysql, at least it's tested :-) For the typical loop for id in db.someclass.filter(...): n = db.someclass.getnode(id) This will not produce a database query for each getnode call when using filter_iter instead of filter because filter_iter fetches the whole row (not just the id) and pre-populates the cache in roundup. So the getnode call already fetches from the cache. This is not (yet?) used in the templates but it may speed up things for some workloads. > The reason i was looking into the current docs was to start estimating > whether it would be feasible to migrate and merge our backend changes > to Roundup 2.x - and thus gain Python 3 compatibility. If you'd like > to, i can keep you posted once we decide to upgrade and the merge has > been done. Yes, I'd also like to know what you decide. Ralf -- Dr. Ralf Schlatterbeck Tel: +43/2243/26465-16 Open Source Consulting www: www.runtux.com Reichergasse 131, A-3411 Weidling email: office@runtux.com
msg7718	Author: [hidden] (rouilj)	Date: 2023-02-20 00:06
To partly address this. The charset is selectable. By default as of 2022-07-13 version 2.2.0 collation is set to: utf8_general_ci by default for new db's. Werner did you complete your upgrade?
msg7776	Author: [hidden] (rouilj)	Date: 2023-05-28 13:53
Werner, I realize it's been a while but have you been able to upgrade to a newer Roundup? Was Ralf's suggestion to use filter_iter helpful? -- rouilj
msg7980	Author: [hidden] (rouilj)	Date: 2024-04-07 02:53
Hello Werner: I have made some changes to Roundup that I think addresses your issues. From the changelog note: === Fix issues with utf8 support in Roundup. By default using: utf8mb4 charset utf8mb4_unicode_ci collation (case insensitive) utf8mb4_0900_ci collation (case sensitive) which are settable from config.ini. Sadly I couldn't come up with a way to mange these from one parameter. Doing a compatibility lookup table would have increased the maintenance burden and have me chasing MySQL changes. So I opted for the easy path and have the admins (with more MySQL experience) make the choices. Conversion directions added to upgrading.txt. I don't have any good testing for this. I was able to generate utf8/utf8mb3 tables and load a little data and convert. However this is a poor substitute for a conversion on a working tracker 8-(. === You can see the changes at: https://sourceforge.net/p/roundup/code/ci/8b31893f5930cbea37031d49bd08a631d1b6b654/ If you can please review them and see if I missed anything. Your expertise in this area would be valuable. Thanks and have a great week.
msg8099	Author: [hidden] (rouilj)	Date: 2024-07-13 22:41
2.4.0 is released with utf8mb4 as default.

History
Date	User	Action	Args
2024-07-13 22:41:49	rouilj	set	status: pending -> fixed resolution: remind -> fixed messages: + msg8099
2024-04-07 02:53:04	rouilj	set	resolution: remind messages: + msg7980
2023-05-28 13:53:10	rouilj	set	messages: + msg7776
2023-02-20 00:06:01	rouilj	set	messages: + msg7718
2021-03-23 07:21:56	schlatterbeck	set	messages: + msg7133
2021-03-22 20:30:44	rouilj	set	messages: + msg7131
2021-03-22 19:38:41	whunger	set	messages: + msg7130
2021-03-18 01:15:11	rouilj	set	messages: + msg7121
2021-03-12 22:42:22	rouilj	set	status: open -> pending
2021-03-11 07:56:19	schlatterbeck	set	nosy: + schlatterbeck
2021-03-10 15:39:53	rouilj	set	status: new -> open assignee: rouilj
2021-03-10 15:36:59	rouilj	set	nosy: + rouilj messages: + msg7094
2021-03-10 14:51:14	whunger	create

Roundup Tracker - Issues

Issue 2551115