- Docs Home
- About TiDB
- Quick Start
- Develop
- Overview
- Quick Start
- Build a TiDB Cluster in TiDB Cloud (Developer Tier)
- CRUD SQL in TiDB
- Build a Simple CRUD App with TiDB
- Example Applications
- Connect to TiDB
- Design Database Schema
- Write Data
- Read Data
- Transaction
- Optimize
- Troubleshoot
- Reference
- Cloud Native Development Environment
- Third-party Support
- Deploy
- Software and Hardware Requirements
- Environment Configuration Checklist
- Plan Cluster Topology
- Install and Start
- Verify Cluster Status
- Test Cluster Performance
- Migrate
- Overview
- Migration Tools
- Migration Scenarios
- Migrate from Aurora
- Migrate MySQL of Small Datasets
- Migrate MySQL of Large Datasets
- Migrate and Merge MySQL Shards of Small Datasets
- Migrate and Merge MySQL Shards of Large Datasets
- Migrate from CSV Files
- Migrate from SQL Files
- Migrate from One TiDB Cluster to Another TiDB Cluster
- Migrate from TiDB to MySQL-compatible Databases
- Advanced Migration
- Integrate
- Overview
- Integration Scenarios
- Maintain
- Monitor and Alert
- Troubleshoot
- TiDB Troubleshooting Map
- Identify Slow Queries
- Analyze Slow Queries
- SQL Diagnostics
- Identify Expensive Queries Using Top SQL
- Identify Expensive Queries Using Logs
- Statement Summary Tables
- Troubleshoot Hotspot Issues
- Troubleshoot Increased Read and Write Latency
- Save and Restore the On-Site Information of a Cluster
- Troubleshoot Cluster Setup
- Troubleshoot High Disk I/O Usage
- Troubleshoot Lock Conflicts
- Troubleshoot TiFlash
- Troubleshoot Write Conflicts in Optimistic Transactions
- Troubleshoot Inconsistency Between Data and Indexes
- Performance Tuning
- Tuning Guide
- Configuration Tuning
- System Tuning
- Software Tuning
- SQL Tuning
- Overview
- Understanding the Query Execution Plan
- SQL Optimization Process
- Overview
- Logic Optimization
- Physical Optimization
- Prepare Execution Plan Cache
- Control Execution Plans
- Tutorials
- TiDB Tools
- Overview
- Use Cases
- Download
- TiUP
- Documentation Map
- Overview
- Terminology and Concepts
- Manage TiUP Components
- FAQ
- Troubleshooting Guide
- Command Reference
- Overview
- TiUP Commands
- TiUP Cluster Commands
- Overview
- tiup cluster audit
- tiup cluster check
- tiup cluster clean
- tiup cluster deploy
- tiup cluster destroy
- tiup cluster disable
- tiup cluster display
- tiup cluster edit-config
- tiup cluster enable
- tiup cluster help
- tiup cluster import
- tiup cluster list
- tiup cluster patch
- tiup cluster prune
- tiup cluster reload
- tiup cluster rename
- tiup cluster replay
- tiup cluster restart
- tiup cluster scale-in
- tiup cluster scale-out
- tiup cluster start
- tiup cluster stop
- tiup cluster template
- tiup cluster upgrade
- TiUP DM Commands
- Overview
- tiup dm audit
- tiup dm deploy
- tiup dm destroy
- tiup dm disable
- tiup dm display
- tiup dm edit-config
- tiup dm enable
- tiup dm help
- tiup dm import
- tiup dm list
- tiup dm patch
- tiup dm prune
- tiup dm reload
- tiup dm replay
- tiup dm restart
- tiup dm scale-in
- tiup dm scale-out
- tiup dm start
- tiup dm stop
- tiup dm template
- tiup dm upgrade
- TiDB Cluster Topology Reference
- DM Cluster Topology Reference
- Mirror Reference Guide
- TiUP Components
- PingCAP Clinic Diagnostic Service
- TiDB Operator
- Dumpling
- TiDB Lightning
- TiDB Data Migration
- About TiDB Data Migration
- Architecture
- Quick Start
- Deploy a DM cluster
- Tutorials
- Advanced Tutorials
- Maintain
- Cluster Upgrade
- Tools
- Performance Tuning
- Manage Data Sources
- Manage Tasks
- Export and Import Data Sources and Task Configurations of Clusters
- Handle Alerts
- Daily Check
- Reference
- Architecture
- Command Line
- Configuration Files
- OpenAPI
- Compatibility Catalog
- Secure
- Monitoring and Alerts
- Error Codes
- Glossary
- Example
- Troubleshoot
- Release Notes
- Backup & Restore (BR)
- Point-in-Time Recovery
- TiDB Binlog
- TiCDC
- Dumpling
- sync-diff-inspector
- TiSpark
- Reference
- Cluster Architecture
- Key Monitoring Metrics
- Secure
- Privileges
- SQL
- SQL Language Structure and Syntax
- SQL Statements
ADD COLUMN
ADD INDEX
ADMIN
ADMIN CANCEL DDL
ADMIN CHECKSUM TABLE
ADMIN CHECK [TABLE|INDEX]
ADMIN SHOW DDL [JOBS|QUERIES]
ADMIN SHOW TELEMETRY
ALTER DATABASE
ALTER INDEX
ALTER INSTANCE
ALTER PLACEMENT POLICY
ALTER TABLE
ALTER TABLE COMPACT
ALTER TABLE SET TIFLASH MODE
ALTER USER
ANALYZE TABLE
BACKUP
BATCH
BEGIN
CHANGE COLUMN
COMMIT
CHANGE DRAINER
CHANGE PUMP
CREATE [GLOBAL|SESSION] BINDING
CREATE DATABASE
CREATE INDEX
CREATE PLACEMENT POLICY
CREATE ROLE
CREATE SEQUENCE
CREATE TABLE LIKE
CREATE TABLE
CREATE USER
CREATE VIEW
DEALLOCATE
DELETE
DESC
DESCRIBE
DO
DROP [GLOBAL|SESSION] BINDING
DROP COLUMN
DROP DATABASE
DROP INDEX
DROP PLACEMENT POLICY
DROP ROLE
DROP SEQUENCE
DROP STATS
DROP TABLE
DROP USER
DROP VIEW
EXECUTE
EXPLAIN ANALYZE
EXPLAIN
FLASHBACK TABLE
FLUSH PRIVILEGES
FLUSH STATUS
FLUSH TABLES
GRANT <privileges>
GRANT <role>
INSERT
KILL [TIDB]
LOAD DATA
LOAD STATS
MODIFY COLUMN
PREPARE
RECOVER TABLE
RENAME INDEX
RENAME TABLE
REPLACE
RESTORE
REVOKE <privileges>
REVOKE <role>
ROLLBACK
SAVEPOINT
SELECT
SET DEFAULT ROLE
SET [NAMES|CHARACTER SET]
SET PASSWORD
SET ROLE
SET TRANSACTION
SET [GLOBAL|SESSION] <variable>
SHOW ANALYZE STATUS
SHOW [BACKUPS|RESTORES]
SHOW [GLOBAL|SESSION] BINDINGS
SHOW BUILTINS
SHOW CHARACTER SET
SHOW COLLATION
SHOW [FULL] COLUMNS FROM
SHOW CONFIG
SHOW CREATE PLACEMENT POLICY
SHOW CREATE SEQUENCE
SHOW CREATE TABLE
SHOW CREATE USER
SHOW DATABASES
SHOW DRAINER STATUS
SHOW ENGINES
SHOW ERRORS
SHOW [FULL] FIELDS FROM
SHOW GRANTS
SHOW INDEX [FROM|IN]
SHOW INDEXES [FROM|IN]
SHOW KEYS [FROM|IN]
SHOW MASTER STATUS
SHOW PLACEMENT
SHOW PLACEMENT FOR
SHOW PLACEMENT LABELS
SHOW PLUGINS
SHOW PRIVILEGES
SHOW [FULL] PROCESSSLIST
SHOW PROFILES
SHOW PUMP STATUS
SHOW SCHEMAS
SHOW STATS_HEALTHY
SHOW STATS_HISTOGRAMS
SHOW STATS_META
SHOW STATUS
SHOW TABLE NEXT_ROW_ID
SHOW TABLE REGIONS
SHOW TABLE STATUS
SHOW [FULL] TABLES
SHOW [GLOBAL|SESSION] VARIABLES
SHOW WARNINGS
SHUTDOWN
SPLIT REGION
START TRANSACTION
TABLE
TRACE
TRUNCATE
UPDATE
USE
WITH
- Data Types
- Functions and Operators
- Overview
- Type Conversion in Expression Evaluation
- Operators
- Control Flow Functions
- String Functions
- Numeric Functions and Operators
- Date and Time Functions
- Bit Functions and Operators
- Cast Functions and Operators
- Encryption and Compression Functions
- Locking Functions
- Information Functions
- JSON Functions
- Aggregate (GROUP BY) Functions
- Window Functions
- Miscellaneous Functions
- Precision Math
- Set Operations
- List of Expressions for Pushdown
- TiDB Specific Functions
- Clustered Indexes
- Constraints
- Generated Columns
- SQL Mode
- Table Attributes
- Transactions
- Garbage Collection (GC)
- Views
- Partitioning
- Temporary Tables
- Cached Tables
- Character Set and Collation
- Placement Rules in SQL
- System Tables
mysql
- INFORMATION_SCHEMA
- Overview
ANALYZE_STATUS
CLIENT_ERRORS_SUMMARY_BY_HOST
CLIENT_ERRORS_SUMMARY_BY_USER
CLIENT_ERRORS_SUMMARY_GLOBAL
CHARACTER_SETS
CLUSTER_CONFIG
CLUSTER_HARDWARE
CLUSTER_INFO
CLUSTER_LOAD
CLUSTER_LOG
CLUSTER_SYSTEMINFO
COLLATIONS
COLLATION_CHARACTER_SET_APPLICABILITY
COLUMNS
DATA_LOCK_WAITS
DDL_JOBS
DEADLOCKS
ENGINES
INSPECTION_RESULT
INSPECTION_RULES
INSPECTION_SUMMARY
KEY_COLUMN_USAGE
METRICS_SUMMARY
METRICS_TABLES
PARTITIONS
PLACEMENT_POLICIES
PROCESSLIST
REFERENTIAL_CONSTRAINTS
SCHEMATA
SEQUENCES
SESSION_VARIABLES
SLOW_QUERY
STATISTICS
TABLES
TABLE_CONSTRAINTS
TABLE_STORAGE_STATS
TIDB_HOT_REGIONS
TIDB_HOT_REGIONS_HISTORY
TIDB_INDEXES
TIDB_SERVERS_INFO
TIDB_TRX
TIFLASH_REPLICA
TIKV_REGION_PEERS
TIKV_REGION_STATUS
TIKV_STORE_STATUS
USER_PRIVILEGES
VARIABLES_INFO
VIEWS
METRICS_SCHEMA
- UI
- TiDB Dashboard
- Overview
- Maintain
- Access
- Overview Page
- Cluster Info Page
- Top SQL Page
- Key Visualizer Page
- Metrics Relation Graph
- SQL Statements Analysis
- Slow Queries Page
- Cluster Diagnostics
- Monitoring Page
- Search Logs Page
- Instance Profiling
- Session Management and Configuration
- FAQ
- CLI
- Command Line Flags
- Configuration File Parameters
- System Variables
- Storage Engines
- Telemetry
- Errors Codes
- Table Filter
- Schedule Replicas by Topology Labels
- FAQs
- Release Notes
- All Releases
- Release Timeline
- TiDB Versioning
- TiDB Installation Packages
- v6.2
- v6.1
- v6.0
- v5.4
- v5.3
- v5.2
- v5.1
- v5.0
- v4.0
- v3.1
- v3.0
- v2.1
- v2.0
- v1.0
- Glossary
Character Set and Collation
This document introduces the character sets and collations supported by TiDB.
Concepts
A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above.
A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation A
and a
do not compare as equal:
SET NAMES utf8mb4 COLLATE utf8mb4_bin;
SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SELECT 'A' = 'a';
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)
TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.
Character sets and collations supported by TiDB
Currently, TiDB supports the following character sets:
SHOW CHARACTER SET;
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| ascii | US ASCII | ascii_bin | 1 |
| binary | binary | binary | 1 |
| gbk | Chinese Internal Code Specification | gbk_bin | 2 |
| latin1 | Latin1 | latin1_bin | 1 |
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
+---------+-------------------------------------+-------------------+--------+
6 rows in set (0.00 sec)
TiDB supports the following collations:
mysql> show collation;
+-------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------+---------+------+---------+----------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| latin1_bin | latin1 | 47 | Yes | Yes | 1 |
| binary | binary | 63 | Yes | Yes | 1 |
| ascii_bin | ascii | 65 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | Yes | Yes | 1 |
| gbk_bin | gbk | 87 | Yes | Yes | 1 |
+-------------+---------+------+---------+----------+---------+
6 rows in set (0.00 sec)
TiDB incorrectly treats latin1 as a subset of utf8. This can lead to unexpected behaviors when you store characters that differ between latin1 and utf8 encodings. It is strongly recommended to the utf8mb4 character set. See TiDB #18955 for more details.
The default collations in TiDB (binary collations, with the suffix _bin
) are different than the default collations in MySQL (typically general collations, with the suffix _general_ci
). This can cause incompatible behavior when specifying an explicit character set but relying on the implicit default collation to be chosen.
You can use the following statement to view the collations (under the new framework for collations) that corresponds to the character set.
SHOW COLLATION WHERE Charset = 'utf8mb4';
+--------------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 1 |
+--------------------+---------+------+---------+----------+---------+
3 rows in set (0.00 sec)
For details about the TiDB support of the GBK character set, see GBK.
utf8
and utf8mb4
in TiDB
In MySQL, the character set utf8
is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set utf8mb4
instead.
By default, TiDB provides the same 3-byte limit on utf8
to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of check-mb4-value-in-utf8
to FALSE
in your TiDB configuration file.
The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The INSERT
statement fails for the utf8
character set, but succeeds for utf8mb4
:
mysql> CREATE TABLE utf8_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> CREATE TABLE utf8m4_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c |
+----------------+-----------+------+
| 1 | 4 | 😉 |
+----------------+-----------+------+
1 row in set (0.00 sec)
Character set and collation in different layers
The character set and collation can be set at different layers.
Database character set and collation
Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
ALTER DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
DATABASE
can be replaced with SCHEMA
here.
Different databases can use different character sets and collations. Use the character_set_database
and collation_database
to see the character set and collation of the current database:
CREATE SCHEMA test1 CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.09 sec)
USE test1;
Database changed
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| utf8mb4 | utf8mb4_general_ci |
+--------------------------|----------------------+
1 row in set (0.00 sec)
CREATE SCHEMA test2 CHARACTER SET latin1 COLLATE latin1_bin;
Query OK, 0 rows affected (0.09 sec)
USE test2;
Database changed
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| latin1 | latin1_bin |
+--------------------------|----------------------+
1 row in set (0.00 sec)
You can also see the two values in INFORMATION_SCHEMA
:
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';
Table character set and collation
You can use the following statement to specify the character set and collation for tables:
CREATE TABLE tbl_name (column_list)
[[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]]
ALTER TABLE tbl_name
[[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]
For example:
CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.08 sec)
If the table character set and collation are not specified, the database character set and collation are used as their default values.
Column character set and collation
You can use the following statement to specify the character set and collation for columns:
col_name {CHAR | VARCHAR | TEXT} (col_length)
[CHARACTER SET charset_name]
[COLLATE collation_name]
col_name {ENUM | SET} (val_list)
[CHARACTER SET charset_name]
[COLLATE collation_name]
If the column character set and collation are not specified, the table character set and collation are used as their default values.
String character sets and collation
Each string corresponds to a character set and a collation. When you use a string, this option is available:
[_charset_name]'string' [COLLATE collation_name]
Example:
SELECT 'string';
SELECT _utf8mb4'string';
SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci;
Rules:
- Rule 1: If you specify
CHARACTER SET charset_name
andCOLLATE collation_name
, then thecharset_name
character set and thecollation_name
collation are used directly. - Rule 2: If you specify
CHARACTER SET charset_name
but do not specifyCOLLATE collation_name
, thecharset_name
character set and the default collation ofcharset_name
are used. - Rule 3: If you specify neither
CHARACTER SET charset_name
norCOLLATE collation_name
, the character set and collation given by the system variablescharacter_set_connection
andcollation_connection
are used.
Client connection character set and collation
The server character set and collation are the values of the
character_set_server
andcollation_server
system variables.The character set and collation of the default database are the values of the
character_set_database
andcollation_database
system variables.
You can use character_set_connection
and collation_connection
to specify the character set and collation for each connection. The character_set_client
variable is to set the client character set.
Before returning the result, the character_set_results
system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.
You can use the following statement to set the character set and collation that is related to the client:
SET NAMES 'charset_name' [COLLATE 'collation_name']
SET NAMES
indicates what character set the client will use to send SQL statements to the server.SET NAMES utf8mb4
indicates that all the requests from the client use utf8mb4, as well as the results from the server.The
SET NAMES 'charset_name'
statement is equivalent to the following statement combination:SET character_set_client = charset_name; SET character_set_results = charset_name; SET character_set_connection = charset_name;
COLLATE
is optional, if absent, the default collation of thecharset_name
is used to set thecollation_connection
.SET CHARACTER SET 'charset_name'
Similar to
SET NAMES
, theSET NAMES 'charset_name'
statement is equivalent to the following statement combination:SET character_set_client = charset_name; SET character_set_results = charset_name; SET charset_connection = @@charset_database; SET collation_connection = @@collation_database;
Selection priorities of character sets and collations
String > Column > Table > Database > Server
General rules on selecting character sets and collation
- Rule 1: If you specify
CHARACTER SET charset_name
andCOLLATE collation_name
, then thecharset_name
character set and thecollation_name
collation are used directly. - Rule 2: If you specify
CHARACTER SET charset_name
and do not specifyCOLLATE collation_name
, then thecharset_name
character set and the default collation ofcharset_name
are used. - Rule 3: If you specify neither
CHARACTER SET charset_name
norCOLLATE collation_name
, the character set and collation with higher optimization levels are used.
Validity check of characters
If the specified character set is utf8
or utf8mb4
, TiDB only supports the valid utf8
characters. For invalid characters, TiDB reports the incorrect utf8 value
error. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.
To disable this error reporting, use set @@tidb_skip_utf8_check=1;
to skip the character check.
If the character check is skipped, TiDB might fail to detect illegal UTF-8 characters written by the application, cause decoding errors when ANALYZE
is executed, and introduce other unknown encoding issues. If your application cannot guarantee the validity of the written string, it is not recommended to skip the character check.
Collation support framework
The syntax support and semantic support for the collation are influenced by the new_collations_enabled_on_first_bootstrap
configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings.
Before v4.0, TiDB provides only the old framework for collations. In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations.
Since v4.0, TiDB supports a new framework for collations. In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings.
Old framework for collations
Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB does not handle the trailing spaces of a character, which causes the following behavior differences:
CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY);
Query OK, 0 rows affected
INSERT INTO t VALUES ('A');
Query OK, 1 row affected
INSERT INTO t VALUES ('a');
Query OK, 1 row affected # In TiDB, it is successfully executed. In MySQL, because utf8mb4_general_ci is case-insensitive, the `Duplicate entry 'a'` error is reported.
INSERT INTO t1 VALUES ('a ');
Query OK, 1 row affected # In TiDB, it is successfully executed. In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned.
New framework for collations
Since TiDB v4.0, a complete framework for collations is introduced.
This new framework supports semantically parsing collations and introduces the new_collations_enabled_on_first_bootstrap
configuration item to decide whether to enable the new framework when a cluster is first initialized. To enable the new framework, set new_collations_enabled_on_first_bootstrap
to true
. For details, see new_collations_enabled_on_first_bootstrap
. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the new_collation_enabled
variable in the mysql
.tidb
table:
SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
+----------------+
| VARIABLE_VALUE |
+----------------+
| True |
+----------------+
1 row in set (0.00 sec)
This new framework supports semantically parsing collations. TiDB enables the new framework by default when a cluster is first initialized.
Under the new framework, TiDB supports the utf8_general_ci
, utf8mb4_general_ci
, utf8_unicode_ci
, utf8mb4_unicode_ci
, gbk_chinese_ci
, and gbk_bin
collations, which is compatible with MySQL.
When one of utf8_general_ci
, utf8mb4_general_ci
, utf8_unicode_ci
, utf8mb4_unicode_ci
, and gbk_chinese_ci
is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's PADDING
behavior:
CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY);
Query OK, 0 rows affected (0.00 sec)
INSERT INTO t VALUES ('A');
Query OK, 1 row affected (0.00 sec)
INSERT INTO t VALUES ('a');
ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY' # TiDB is compatible with the case-insensitive collation of MySQL.
INSERT INTO t VALUES ('a ');
ERROR 1062 (23000): Duplicate entry 'a ' for key 'PRIMARY' # TiDB modifies the `PADDING` behavior to be compatible with MySQL.
The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of 'a' < 'a\t'
in TiDB is 1
, but in MySQL, 'a' < 'a\t'
is equivalent to 'a ' < 'a\t'
, and the result is 0
.
Coercibility values of collations in expressions
If an expression involves multiple clauses of different collations, you need to infer the collation used in the calculation. The rules are as follows:
- The coercibility value of the explicit
COLLATE
clause is0
. - If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is
1
. - The collation of the column,
CAST()
,CONVERT()
, orBINARY()
has a coercibility value of2
. - The system constant (the string returned by
USER ()
orVERSION ()
) has a coercibility value of3
. - The coercibility value of constants is
4
. - The coercibility value of numbers or intermediate variables is
5
. NULL
or expressions derived fromNULL
has a coercibility value of6
.
When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority:
binary > utf8mb4_bin > (utf8mb4_general_ci = utf8mb4_unicode_ci) > utf8_bin > (utf8_general_ci = utf8_unicode_ci) > latin1_bin > ascii_bin
TiDB cannot infer the collation and reports an error in the following situations:
- If the collations of two clauses are different and the coercibility value of both clauses is
0
. - If the collations of two clauses are incompatible and the returned type of expression is
String
.
COLLATE
clause
TiDB supports using the COLLATE
clause to specify the collation of an expression. The coercibility value of this expression is 0
, which has the highest priority. See the following example:
SELECT 'a' = _utf8mb4 'A' collate utf8mb4_general_ci;
+-----------------------------------------------+
| 'a' = _utf8mb4 'A' collate utf8mb4_general_ci |
+-----------------------------------------------+
| 1 |
+-----------------------------------------------+
1 row in set (0.00 sec)
For more details, see Connection Character Sets and Collations.
- Concepts
- Character sets and collations supported by TiDB
- utf8 and utf8mb4 in TiDB
- Character set and collation in different layers
- Selection priorities of character sets and collations
- General rules on selecting character sets and collation
- Validity check of characters
- Collation support framework
- Coercibility values of collations in expressions
- COLLATE clause