Skip to content

Improve like perf for utf8 CI collation #10399

@ChangRui-Ryan

Description

@ChangRui-Ryan

Enhancement

Performance Comparison: utf8*_ci vs. Bin Collations for LIKE Operations (TiFlash + TPCH Dataset)

1. Test Background & Objective

Verify the performance difference between utf8_ci collations (case-insensitive, UTF-8-aware)* and bin collations (binary, byte-level) when executing LIKE pattern matching queries in TiFlash (TiDB's columnar storage engine). The core focus is on comparing query latency under the two collation types, with specific attention to the overhead of UTF-8 processing in utf8*_ci collations.

2. Test Environment & Configuration

Configuration Item Details
Dataset TPCH 10G Dataset (only the orders table is used, which contains the text field o_comment stored in UTF-8)
Execution Engine TiFlash (TiDB Columnar Storage Engine)
Concurrency Control Restrict TiFlash to 1 concurrent thread via set tidb_max_tiflash_threads=1; to eliminate multi-threading interference
Test Field orders.o_comment (UTF-8 text field used for LIKE pattern matching)
Collation Types Two collations are compared:
- utf8*_ci Collation: UTF-8-aware case-insensitive collations (e.g., utf8mb4_general_ci, utf8mb4_unicode_ci)
- Bin Collation: Binary collations (e.g., utf8mb4_bin)

3. Test Steps

  1. Environment Initialization: Load the TPCH 10G Dataset and ensure the orders table has complete, UTF-8-encoded data.
  2. Concurrency Configuration: Execute the following command in the TiDB client to fix TiFlash's concurrency and isolate performance variables:
    set tidb_max_tiflash_threads=1;
  3. Query Execution (utf8_ci Collation Scenario)*: Set the o_comment field to a utf8*_ci collation (e.g., utf8mb4_general_ci), then run the core LIKE query from TPCH Query 13 (Q13):
    SELECT COUNT(*) FROM orders WHERE o_comment LIKE "%special%requests%";
  4. Query Execution (Bin Collation Scenario): Change the o_comment field to a bin collation (e.g., utf8mb4_bin), then re-run the same query from Step 3 to ensure consistency.
  5. Latency Measurement: Record the total execution time (latency) of the query in both scenarios (from request initiation to result return), averaging 3 runs to minimize variance.

4. Test Results

Collation Type Collation Example Average Query Latency
utf8*_ci Collation utf8mb4_general_ci ~5 seconds
Bin Collation utf8mb4_bin ~0.7 seconds

5. Conclusion

Under TiFlash's single-threaded concurrency, for the LIKE "%special%requests%" pattern matching query on the UTF-8-encoded orders.o_comment field:

  • The latency of bin collations (~0.7 seconds) is only 14% of that of utf8*_ci collations (~5 seconds), demonstrating a dramatic performance advantage.
  • Primary Root Cause: utf8*_ci collations incur far higher overhead because they require UTF-8 decoding and character normalization as a prerequisite for comparison. Raw UTF-8 byte sequences must first be decoded into logical Unicode characters; additional processing is then needed to handle character equivalence (e.g., case folding like A vs a, or accented character normalization like é vs ). This two-step Unicode-aware processing adds substantial computational cost, especially for large text fields in bulk queries.
  • By contrast, bin collations bypass all Unicode-related overhead: they compare strings directly as raw UTF-8 byte sequences, with no decoding, normalization, or case-insensitive checks. This byte-level comparison logic is minimal and efficient, resulting in drastically lower latency for LIKE operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions