MySQL 中 utf8 和 utf8mb4 编码

2020.02.25

最近在开发过程中碰到数据库写入失败的情况。

“Incorrect string value: ‘\xF0\x9F\x91\x91M\xE3…” for column ‘nick_name’ at row 1”

排查发现是由于该用户的昵称含有 emoji 表情包导致的,含有表情包和普通字符串有什么区别呢?

原来 MySQL 的默认 utf8 (也称 utf8mb3) 只支持三个字节的字符。而 emoji 表情包需要四个字节表示。所以存储就会报错。不过 MySQL 在 5.5.3 之后增加了这个 utf8mb4 的编码,用来兼容四字节的 Unicode。

为了解决这个问题,需要修改数据库的字符编码。

> show variables like '%char%';
+---------------------------+----------------------------+
| Variable_name             | Value                      |
+---------------------------+----------------------------+
| character_set_client      | utf8                       |
| character_set_connection  | utf8                       |
| character_set_database    | utf8mb4                    |
| character_set_filesystem  | binary                     |
| character_set_results     | utf8                       |
| character_set_server      | utf8mb4                    |
| character_set_system      | utf8                       |
| character_sets_dir        | /opt/mysql/share/charsets/ |
| ft_query_extra_word_chars | OFF                        |
+---------------------------+----------------------------+

> show create database xxx;
+-------------------------+----------------------------------------------------------------------------------------------------------------+
| Database                | Create Database                                                                                                |
+-------------------------+----------------------------------------------------------------------------------------------------------------+
| xxx | CREATE DATABASE `xxx` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
+-------------------------+----------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

> show create table user;
| user  | CREATE TABLE `user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(80) COLLATE utf8mb4_unicode_ci NOT NULL COMMENT '用户ID',
  `nick_name` varchar(120) COLLATE utf8mb4_unicode_ci NOT NULL COMMENT '昵称',
  `avatar` varchar(300) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '头像数据',
  `update_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '记录更新时间',
  PRIMARY KEY (`id`),
  UNIQUE KEY `user_id` (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=157 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

通过查看数据库、表的字符编码发现已经是 utf8mb4,那么问题可能出在客户端。查看 Flask 项目的配置文件,发现客户端配置的编码为 utf8。将编码格式修改为 utf8mb4 即可解决问题。

class devConfig(baseConfig):
    SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://root:xxxx@localhost/ncov?charset=utf8mb4'

UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL’s encoding called “utf8” (alias of “utf8mb3”) only stores a maximum of three bytes per code point.

So the character set “utf8”/“utf8mb3” cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the “Basic Multilingual Plane”. See also Comparison of Unicode encodings.

参考:What is the difference between utf8mb4 and utf8 charsets in MySQL?

三个字节的 uft8 支持编码的 Unicode 字符范围是 0x000-0xffff,称之为 Unicode 的基本多文种平面(BMP),不在基本多文种平面的 Unicode 字符都无法使用 MySQL 中的 utf8 字符集存储。

通过以下命令可直接将数据库字符集变更为 utf8mb4 编码。

# 修改数据库:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# 修改表:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# 修改表字段:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;