MySQL, 4-byte unicode and utf8mb4 charset

Describe the problem you’re having:

4-byte unicode characters are broken when inserted into ttrss_entries table. I have modded my MySQL database to use utf8mb4 charset in order to properly hold 4-byte unicode characters, however those characters are still munged when put into the database because of rssutils.php. I’m not sure what the default MySQL schema does with these tables, so I’m assuming I’m in unsupported territory. (See https://tt-rss.org/oldforum/viewtopic.php?t=2716 to see what I did.)

If possible include steps to reproduce the problem:

Update a feed with 4-byte unicode (like emoji) characters. Example IRembemberYou comments on 🔥Giant rhino beetle! - (Dynastes hercules)🔥

tt-rss version (including git commit id):

tt-rss via git, up to commit 20d2195f13948024a4eacf3595055678a77855c2 (latest as of 2017/08/08)

Platform (i.e. Linux distro, PHP, PostgreSQL, etc) versions:

Windows 2008R2, PHP 5.6.31, MySQL 5.5.34

Please provide any additional information below:

I could create a pull request to fix this (comment out line 695 in the current classes/rssutils.php plus another recommended change to classes/db/mysqli.php) if desired, but that might break things for others who haven’t updated their schema as I have.

it is unfortunately a very invasive change. risking breaking stuff for thousands of people because of emoji of all things just doesn’t seem worth it.

if you don’t want to maintain your modifications i can suggest switching to postgresql which should have this working out of the box.

e: remember that there’s tons of people running tt-rss on shit tier webhosting with old and busted mysql versions, and i need to keep everything consistent and backwards compatible. again, it’s just not worth the trouble imo.

I really hate getting off trunk/origin/master etc. Would you be willing to have a change that modified the check on line 690 into something like:

if (DB_TYPE == "mysql" && (!defined('MYSQL_CHARSET') || MYSQL_CHARSET != 'utf8mb4') ) { }

This at least would prevent the screwed up 4-byte chars for the folks that new what they were doing without breaking the installs for the unwashed masses. The other change I mentioned is using PHP’s recommended method for setting the charset, i.e. line 79 of mysqli.php:

mysqli_set_charset($this->link, MYSQL_CHARSET);

See PHP: mysqli::set_charset - Manual for the recommendation against using “SET NAMES CHARSET.”

e: BTW, thanks for the advice, you are certainly right about moving to Postgres. That is on the list, as they say.

This was actually discussed at length in two other threads here and here. The second link is where I discussed it in detail, including patches and proposed solutions. At the end of the day there are a variety of potential solutions but like fox mentioned the potential for breaking installations that have been running for years is pretty high.

What happens now is what should happen when handling unsupported mutlibyte characters: the unsupported characters are replaced with the “substitute character”.

FWIW in my development environment I changed just the ttrss_entries table to utf8mb4 and it worked for months without issue. The problem is that future schema changes could fail to be applied if people start deviating from what’s in the core.

using mysqli_set_charset() instead of set names might be a good idea, i’m not changing anything else because, like i said already, in my opinion it’s not worth the trouble.

e: set charset is in trunk

Roger that. Many thanks.