Docs Home
About TiDB
Quick Start
Develop
- Overview
- Quick Start
  - Build a TiDB Cluster in TiDB Cloud (Developer Tier)
  - CRUD SQL in TiDB
  - Build a Simple CRUD App with TiDB
    - Java
    - Golang
- Example Applications
  - Build a TiDB Application using Spring Boot
- Connect to TiDB
- Design Database Schema
- Write Data
- Read Data
- Transaction
- Optimize
  - Overview
  - SQL Performance Tuning
  - Best Practices for Performance Tuning
  - Best Practices for Indexing
  - Other Optimization Methods
    - Avoid Implicit Type Conversions
    - Unique Serial Number Generation
- Troubleshoot
- Reference
  - Bookshop Example Application
  - Guidelines
    - Object Naming Convention
    - SQL Development Specifications
  - Archived Docs
- Cloud Native Development Environment
  - Gitpod
- Third-party Support
  - Third-Party Libraries Support
  - Integrate with ProxySQL
Deploy
- Software and Hardware Requirements
- Environment Configuration Checklist
- Plan Cluster Topology
- Install and Start
  - Use TiUP (Recommended)
  - Deploy in Kubernetes
- Verify Cluster Status
- Test Cluster Performance
  - Test TiDB Using Sysbench
  - Test TiDB Using TPC-C
Migrate
Integrate
- Overview
- Integration Scenarios
  - Integrate with Confluent Cloud and Snowflake
  - Integrate with Apache Kafka and Apache Flink
Maintain
Monitor and Alert
Troubleshoot
Performance Tuning
- Tuning Guide
- Configuration Tuning
  - System Tuning
    - Operating System Tuning
  - Software Tuning
    - Configuration
    - Coprocessor Cache
- SQL Tuning
  - Overview
  - Understanding the Query Execution Plan
  - SQL Optimization Process
    - Overview
    - Logic Optimization
    - Physical Optimization
    - Prepare Execution Plan Cache
  - Control Execution Plans
Tutorials
TiDB Tools
- Overview
- Use Cases
- Download
- TiUP
- PingCAP Clinic Diagnostic Service
- TiDB Operator
- Dumpling
- TiDB Lightning
  - Overview
  - Prechecks and requirements
  - Key Features
  - Tutorial
  - Deploy
  - Configure
  - Monitor
  - FAQ
  - Glossary
- TiDB Data Migration
  - About TiDB Data Migration
  - Architecture
  - Quick Start
  - Deploy a DM cluster
  - Tutorials
    - Create a Data Source
    - Manage Data Sources
    - Configure Tasks
    - Table Routing
    - Block and Allow Lists
    - Binlog Event Filter
    - Filter DMLs Using SQL Expressions
    - Manage a Data Migration Task
  - Advanced Tutorials
    - Merge and Migrate Data from Sharded Tables
    - Migrate from MySQL Databases that Use GH-ost/PT-osc
    - Migrate Data to a Downstream TiDB Table with More Columns
  - Maintain
    - Cluster Upgrade
      - Maintain DM Clusters Using TiUP (Recommended)
      - Manually Upgrade from v1.0.x to v2.0+
    - Tools
      - Manage Using WebUI
      - Manage Using dmctl
    - Performance Tuning
    - Manage Data Sources
      - Switch the MySQL Instance to Be Migrated
    - Manage Tasks
      - Handle Failed DDL Statements
      - Manage Schemas of Tables to be Migrated
    - Export and Import Data Sources and Task Configurations of Clusters
    - Handle Alerts
    - Daily Check
  - Reference
    - Architecture
      - DM-worker
      - Relay Log
    - Command Line
      - DM-master & DM-worker
    - Configuration Files
    - OpenAPI
    - Compatibility Catalog
    - Secure
      - Enable TLS for DM Connections
      - Generate Self-signed Certificates
    - Monitoring and Alerts
      - Monitoring Metrics
      - Alert Rules
    - Error Codes
    - Glossary
  - Example
  - Troubleshoot
    - FAQ
    - Handle Errors
  - Release Notes
- Backup & Restore (BR)
- TiDB Binlog
  - Overview
  - Quick Start
  - Deploy
  - Maintain
  - Configure
    - Pump
    - Drainer
  - Upgrade
  - Monitor
  - Reparo
  - binlogctl
  - Binlog Consumer Client
  - TiDB Binlog Relay Log
  - Bidirectional Replication Between TiDB Clusters
  - Glossary
  - Troubleshoot
    - Troubleshoot
    - Handle Errors
  - FAQ
- TiCDC
  - Overview
  - Deploy
  - Maintain
  - Monitor and Alert
    - Monitoring Metrics
    - Alert Rules
  - Troubleshoot
  - Reference
  - FAQs
  - Glossary
- Dumpling
- sync-diff-inspector
- TiSpark
  - User Guide
Reference
FAQs
Release Notes
- All Releases
- Release Timeline
- TiDB Versioning
- v6.1
  - 6.1.0
- v6.0
  - 6.0.0-DMR
- v5.4
- v5.3
- v5.2
- v5.1
- v5.0
- v4.0
- v3.1
- v3.0
- v2.1
- v2.0
- v1.0
  - 1.0.8
  - 1.0.7
  - 1.0.6
  - 1.0.5
  - 1.0.4
  - 1.0.3
  - 1.0.2
  - 1.0.1
  - 1.0
  - Pre-GA
  - RC4
  - RC3
  - RC2
  - RC1
Glossary

TiDB Storage

This document introduces some design ideas and key concepts of TiKV.

storage-architecture

Key-Value pairs

The first thing to decide for a data storage system is the data storage model, that is, in what form the data is saved. TiKV's choice is the Key-Value model and provides an ordered traversal method. There are two key points for TiKV data storage model:

This is a huge Map (similar to std::Map in C++) , which stores Key-Value pairs.
The Key-Value pair in the Map is ordered according to Keys' binary order, which means you can Seek the position of a particular Key and then call the Next method to get the Key-Value pairs larger than this Key in incremental order.

Note that the KV storage model for TiKV described in this document has nothing to do with tables of SQL. This document does not discuss any concepts related to SQL and only focuses on how to implement a high-performance, high-reliability, distributed Key-Value storage such as TiKV.

Local storage (RocksDB)

For any persistent storage engine, data is eventually saved on disk, and TiKV is no exception. TiKV does not write data directly on the disk, but stores data in RocksDB, which is responsible for the data storage. The reason is that it costs a lot to develop a standalone storage engine, especially a high-performance standalone engine that requires careful optimization.

RocksDB is an excellent standalone storage engine open-sourced by Facebook. This engine can meet various requirements of TiKV for a single engine. Here, you can simply consider RocksDB as a single persistent Key-Value Map.

Raft protocol

What's more, the implementation of TiKV faces a more difficult thing: to secure data safety in case a single machine fails.

A simple way is to replicate data to multiple machines, so that even if one machine fails, the replicas on other machines are still available. In other words, you need a data replication scheme that is reliable, efficient, and able to handle the situation of a failed replica. All of these are made possible by the Raft algorithm.

Raft is a consensus algorithm. This document only briefly introduces Raft. For more details, you can see In Search of an Understandable Consensus Algorithm. The Raft has several important features:

Leader election
Membership changes (such as adding replicas, deleting replicas, transferring leaders, and so on)
Log replication

TiKV use Raft to perform data replication. Each data change will be recorded as a Raft log. Through Raft log replication, data is safely and reliably replicated to multiple nodes of the Raft group. However, according to Raft protocol, successful writes only need that data is replicated to the majority of nodes.

Raft in TiDB

In summary, TiKV can quickly store data on disk via the standalone machine RocksDB, and replicate data to multiple machines via Raft in case of machine failure. Data is written through the interface of Raft instead of to RocksDB. With the implementation of Raft, TiKV becomes a distributed Key-Value storage. Even with a few machine failures, TiKV can automatically complete replicas by virtue of the native Raft protocol, which does not impact the application.

Region

To make it easy to understand, let's assume that all data only has one replica. As mentioned earlier, TiKV can be regarded as a large, orderly KV Map, so data is distributed across multiple machines in order to achieve horizontal scalability. For a KV system, there are two typical solutions to distributing data across multiple machines:

Hash: Create Hash by Key and select the corresponding storage node according to the Hash value.
Range: Divide ranges by Key, where a segment of serial Key is stored on a node.

TiKV chooses the second solution that divides the whole Key-Value space into a series of consecutive Key segments. Each segment is called a Region. There is a size limit for each Region to store data (the default value is 96 MB and the size can be configured). Each Region can be described by [StartKey, EndKey), a left-closed and right-open interval.

Region in TiDB

Note that the Region here has nothing to do with the table in SQL. In this document, forget about SQL and focus on KV for now. After dividing data into Regions, TiKV will perform two important tasks:

Distributing data to all nodes in the cluster and use Region as the basic unit. Try its best to ensure that the number of Regions on each node is roughly similar.
Performing Raft replication and membership management in Region.

These two tasks are very important and will be introduced one by one.

First, data is divided into many Regions according to Key, and the data for each Region is stored on only one node (ignoring multiple replicas). The TiDB system has a PD component that is responsible for spreading Regions as evenly as possible across all nodes in the cluster. In this way, on one hand, the storage capacity is scaled horizontally (Regions on the other nodes are automatically scheduled to the newly added node); on the other hand, load balancing is achieved (the situation where one node has a lot of data while the others have little will not occur).
At the same time, in order to ensure that the upper client can access the needed data, there is a component (PD) in the system to record the distribution of Regions on the node, that is, the exact Region of a Key and the node of that Region placed through any Key.
For the second task, TiKV replicates data in Regions, which means that data in one Region will have multiple replicas with the name "Replica". Multiple Replicas of a Region are stored on different nodes to form a Raft Group, which is kept consistent through the Raft algorithm.
One of the Replicas serves as the Leader of the Group and other as the Follower. By default, all reads and writes are processed through the Leader, where reads are done and write are replicated to followers. The following diagram shows the whole picture about Region and Raft group.

TiDB Storage

As we distribute and replicate data in Regions, we have a distributed Key-Value system that, to some extent, has the capability of disaster recovery. You no longer need to worry about the capacity, or disk failure and data loss.

MVCC

Many databases implement multi-version concurrency control (MVCC), and TiKV is no exception. Imagine the situation where two clients modify the value of a Key at the same time. Without MVCC, the data needs to be locked. In a distributed scenario, it might cause performance and deadlock problems. TiKV's MVCC implementation is achieved by appending a version number to Key. In short, without MVCC, TiKV's data layout can be seen as:

Key1 -> Value
Key2 -> Value
……
KeyN -> Value

With MVCC, the key array of TiKV is like this:

Key1_Version3 -> Value
Key1_Version2 -> Value
Key1_Version1 -> Value
……
Key2_Version4 -> Value
Key2_Version3 -> Value
Key2_Version2 -> Value
Key2_Version1 -> Value
……
KeyN_Version2 -> Value
KeyN_Version1 -> Value
……

Note that for multiple versions of the same Key, versions with larger numbers are placed first (see the Key-Value section where Keys are arranged in order), so that when you obtain Value through Key + Version, the Key of MVCC can be constructed with Key and Version, which is Key_Version. Then you can directly locate the first position greater than or equal to this Key_Version through RocksDB's SeekPrefix(Key_Version) API.

Distributed ACID transaction

Transaction of TiKV adopts the model used by Google in BigTable: Percolator. TiKV's implementation is inspired by this paper, with a lot of optimizations. See transaction overview for details.

Download PDF Request docs changes

What’s on this page

Key-Value pairs
Local storage (RocksDB)
Raft protocol
Region
MVCC
Distributed ACID transaction

Was this page helpful?