Docs Home
About TiDB
Quick Start
Develop
- Overview
- Quick Start
  - Build a TiDB Cluster in TiDB Cloud (Developer Tier)
  - CRUD SQL in TiDB
  - Build a Simple CRUD App with TiDB
    - Java
    - Golang
- Example Applications
  - Build a TiDB Application using Spring Boot
- Connect to TiDB
- Design Database Schema
- Write Data
- Read Data
- Transaction
- Optimize
  - Overview
  - SQL Performance Tuning
  - Best Practices for Performance Tuning
  - Best Practices for Indexing
  - Other Optimization Methods
    - Avoid Implicit Type Conversions
    - Unique Serial Number Generation
- Troubleshoot
- Reference
  - Bookshop Example Application
  - Guidelines
    - Object Naming Convention
    - SQL Development Specifications
  - Archived Docs
- Cloud Native Development Environment
  - Gitpod
- Third-party Support
  - Third-Party Libraries Support
  - Integrate with ProxySQL
Deploy
- Software and Hardware Requirements
- Environment Configuration Checklist
- Plan Cluster Topology
- Install and Start
  - Use TiUP (Recommended)
  - Deploy in Kubernetes
- Verify Cluster Status
- Test Cluster Performance
  - Test TiDB Using Sysbench
  - Test TiDB Using TPC-C
Migrate
Integrate
- Overview
- Integration Scenarios
  - Integrate with Confluent Cloud
  - Integrate with Apache Kafka and Apache Flink
Maintain
Monitor and Alert
Troubleshoot
Performance Tuning
- Tuning Guide
- Configuration Tuning
  - System Tuning
    - Operating System Tuning
  - Software Tuning
    - Configuration
    - Coprocessor Cache
- SQL Tuning
  - Overview
  - Understanding the Query Execution Plan
  - SQL Optimization Process
    - Overview
    - Logic Optimization
    - Physical Optimization
    - Prepare Execution Plan Cache
  - Control Execution Plans
Tutorials
TiDB Tools
- Overview
- Use Cases
- Download
- TiUP
- PingCAP Clinic Diagnostic Service
- TiDB Operator
- Dumpling
- TiDB Lightning
  - Overview
  - Prechecks and requirements
  - Key Features
  - Tutorial
  - Deploy
  - Configure
  - Monitor
  - FAQ
  - Glossary
- TiDB Data Migration
  - About TiDB Data Migration
  - Architecture
  - Quick Start
  - Deploy a DM cluster
  - Tutorials
    - Create a Data Source
    - Manage Data Sources
    - Configure Tasks
    - Table Routing
    - Block and Allow Lists
    - Binlog Event Filter
    - Filter DMLs Using SQL Expressions
    - Manage a Data Migration Task
  - Advanced Tutorials
    - Merge and Migrate Data from Sharded Tables
    - Migrate from MySQL Databases that Use GH-ost/PT-osc
    - Migrate Data to a Downstream TiDB Table with More Columns
    - Continuous Data Validation
  - Maintain
    - Cluster Upgrade
      - Maintain DM Clusters Using TiUP (Recommended)
      - Manually Upgrade from v1.0.x to v2.0+
    - Tools
      - Manage Using WebUI
      - Manage Using dmctl
    - Performance Tuning
    - Manage Data Sources
      - Switch the MySQL Instance to Be Migrated
    - Manage Tasks
      - Handle Failed DDL Statements
      - Manage Schemas of Tables to be Migrated
    - Export and Import Data Sources and Task Configurations of Clusters
    - Handle Alerts
    - Daily Check
  - Reference
    - Architecture
      - DM-worker
      - Relay Log
    - Command Line
      - DM-master & DM-worker
    - Configuration Files
    - OpenAPI
    - Compatibility Catalog
    - Secure
      - Enable TLS for DM Connections
      - Generate Self-signed Certificates
    - Monitoring and Alerts
      - Monitoring Metrics
      - Alert Rules
    - Error Codes
    - Glossary
  - Example
  - Troubleshoot
    - FAQ
    - Handle Errors
  - Release Notes
- Backup & Restore (BR)
- Point-in-Time Recovery
- TiDB Binlog
  - Overview
  - Quick Start
  - Deploy
  - Maintain
  - Configure
    - Pump
    - Drainer
  - Upgrade
  - Monitor
  - Reparo
  - binlogctl
  - Binlog Consumer Client
  - TiDB Binlog Relay Log
  - Bidirectional Replication Between TiDB Clusters
  - Glossary
  - Troubleshoot
    - Troubleshoot
    - Handle Errors
  - FAQ
- TiCDC
  - Overview
  - Deploy
  - Maintain
  - Monitor and Alert
    - Monitoring Metrics
    - Alert Rules
  - Troubleshoot
  - Reference
  - FAQs
  - Glossary
- Dumpling
- sync-diff-inspector
- TiSpark
  - User Guide
Reference
FAQs
Release Notes
- All Releases
- Release Timeline
- TiDB Versioning
- TiDB Installation Packages
- v6.2
  - 6.2.0-DMR
- v6.1
  - 6.1.0
- v6.0
  - 6.0.0-DMR
- v5.4
- v5.3
- v5.2
- v5.1
- v5.0
- v4.0
- v3.1
- v3.0
- v2.1
- v2.0
- v1.0
  - 1.0.8
  - 1.0.7
  - 1.0.6
  - 1.0.5
  - 1.0.4
  - 1.0.3
  - 1.0.2
  - 1.0.1
  - 1.0
  - Pre-GA
  - RC4
  - RC3
  - RC2
  - RC1
Glossary

PITR Monitoring and Alert

PITR supports using Prometheus to collect monitoring metrics. Currently all monitoring metrics are built into TiKV.

Monitoring configuration

For clusters deployed using TiUP, Prometheus automatically collects monitoring metrics.
For clusters deployed manually, follow the instructions in TiDB Cluster Monitoring Deployment to add TiKV-related jobs to the scrape_configs section of the Prometheus configuration file.

Monitoring metrics

Metrics	Type	Description
tikv_log_backup_interal_actor_acting_duration_sec	Histogram	The duration of handling all internal messages and events. `message :: TaskType`
tikv_log_backup_initial_scan_reason	Counter	Statistics of the reasons why initial scan is triggered. The main reason is leader transfer or Region version change. `reason :: {"leader-changed", "region-changed", "retry"}`
tikv_log_backup_event_handle_duration_sec	Histogram	The duration of handling KV events. Compared with `tikv_log_backup_on_event_duration_seconds`, this metric also includes the duration of internal conversion. `stage :: {"to_stream_event", "save_to_temp_file"}`
tikv_log_backup_handle_kv_batch	Histogram	Region-level statistics of the sizes of KV pair batches sent by Raftstore.
tikv_log_backup_initial_scan_disk_read	Counter	The size of data read from the disk during initial scan. In Linux, this information is from procfs, which is the size of data actually read from the block device. The configuration item `initial-scan-rate-limit` applies to this metric.
tikv_log_backup_incremental_scan_bytes	Histogram	The size of KV pairs actually generated during initial scan. Because of compression and read amplification, this value might be different from that of `tikv_log_backup_initial_scan_disk_read`.
tikv_log_backup_skip_kv_count	Counter	The number of Raft events being skipped during the log backup because they are not helpful to the backup.
tikv_log_backup_errors	Counter	The errors that can be retried or ignored during the log backup. `type :: ErrorType`
tikv_log_backup_fatal_errors	Counter	The errors that cannot be retried or ignored during the log backup. When an error of this type occurs, the log backup is paused. `type :: ErrorType`
tikv_log_backup_heap_memory	Gauge	The memory occupied by events that are unconsumed and found by initial scan during log backup.
tikv_log_backup_on_event_duration_seconds	Histogram	The duration of storing KV events to temporary files. `stage :: {"write_to_tempfile", "syscall_write"}`
tikv_log_backup_store_checkpoint_ts	Gauge	The store-level checkpoint TS, which is deprecated. It is close to the GC safepoint registered by the current store. `task :: string`
tikv_log_backup_flush_duration_sec	Histogram	The duration of moving local temporary files to the external storage. `stage :: {"generate_metadata", "save_files", "clear_temp_files"}`
tikv_log_backup_flush_file_size	Histogram	Statistics of the sizes of files generated during the backup.
tikv_log_backup_initial_scan_duration_sec	Histogram	The statistics of the overall duration of initial scan.
tikv_log_backup_skip_retry_observe	Counter	Statistics of the errors that can be ignored during log backup, or the reasons why retry is skipped. `reason :: {"region-absent", "not-leader", "stale-command"}`
tikv_log_backup_initial_scan_operations	Counter	Statistics of RocksDB-related operations during initial scan. `cf :: {"default", "write", "lock"}, op :: RocksDBOP`
tikv_log_backup_enabled	Counter	Whether to enable log backup. If the value is greater than `0`, log backup is enabled.
tikv_log_backup_observed_region	Gauge	The number of Regions being listened to.
tikv_log_backup_task_status	Gauge	The status of the log backup task. `0` means running. `1` means paused. `2` means error. `task :: string`
tikv_log_backup_pending_initial_scan	Gauge	Statistics of pending initial scans. `stage :: {"queuing", "executing"}`

Grafana configuration

For clusters deployed using TiUP, the Grafana dashboard contains the PITR panel. The Backup Log panel in the TiKV-Details dashboard is the PITR panel.
For clusters deployed manually, refer to Import a Grafana dashboard and upload the tikv_details JSON file to Grafana. Then find the Backup Log panel in the TiKV-Details dashboard.

Alert configuration

Currently, PITR does not have built-in alert items. This section introduces how to configure alert items in PITR and recommends some items.

To configure alert items in PITR, follow these steps:

Create a configuration file (for example, pitr.rules.yml) for the alert rules on the node where Prometheus is located. In the file, fill in the alert rules according to the Prometheus documentation, the following recommended alert items, and the configuration sample.
In the rule_files field of the Prometheus configuration file, add the path of the alert rule file.
Send SIGHUP signal to the Prometheus process (kill -HUP pid) or send an HTTP POST request to http://prometheus-addr/-/reload (before you send the HTTP request, add the --web.enable-lifecycle parameter when starting Prometheus).

The recommended alert items are as follows:

LogBackupRunningRPOMoreThan10m

Alert item: max(time() - tikv_log_backup_store_checkpoint_ts / 262144000) by (task) / 60 > 10 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0
Alert level: warning
Description: The log data is not persisted to the storage for more than 10 minutes. This alert item is a reminder. In most cases, it does not affect log backup.

A configuration sample of this alert item is as follows:

groups:
- name: PiTR
  rules:
  - alert: LogBackupRunningRPOMoreThan10m
    expr: max(time() - tikv_log_backup_store_checkpoint_ts / 262144000) by (task) / 60 > 10 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0
    labels:
      severity: warning
    annotations:
      summary: RPO of log backup is high
      message: RPO of the log backup task {{ $labels.task }} is more than 10m

LogBackupRunningRPOMoreThan30m

Alert item: max(time() - tikv_log_backup_store_checkpoint_ts / 262144000) by (task) / 60 > 30 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0
Alert level: critical
Description: The log data is not persisted to the storage for more than 30 minutes. This alert often indicates anomalies. You can check the TiKV logs to find the cause.

LogBackupPausingMoreThan2h

Alert item: max(time() - tikv_log_backup_store_checkpoint_ts / 262144000) by (task) / 3600 > 2 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 1
Alert level: warning
Description: The log backup task is paused for more than 2 hours. This alert item is a reminder and you are expected to run br log resume as soon as possible.

LogBackupPausingMoreThan12h

Alert item: max(time() - tikv_log_backup_store_checkpoint_ts / 262144000) by (task) / 3600 > 12 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 1
Alert level: critical
Description: The log backup task is paused for more than 12 hours. You are expected to run br log resume as soon as possible to resume the task. Log tasks paused for too long have the risk of data loss.

LogBackupFailed

Alert item: max(tikv_log_backup_task_status) by (task) == 2 and max(tikv_log_backup_store_checkpoint_ts) by (task) > 0
Alert level: critical
Description: The log backup task fails. You need to run br log status to see the failure reason. If necessary, you need to further check the TiKV logs.

LogBackupGCSafePointExceedsCheckpoint

Alert item: min(tikv_log_backup_store_checkpoint_ts) by (instance) - max(tikv_gcworker_autogc_safe_point) by (instance) < 0
Alert level: critical
Description: Some data has been garbage-collected before the backup. This means that some data has been lost and is very likely to affect your services.

Download PDF Request docs changes Edit this page

What’s on this page

Monitoring configuration
Monitoring metrics
Grafana configuration
Alert configuration

Was this page helpful?