-
Notifications
You must be signed in to change notification settings - Fork 415
Open
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.
Description
Enhancement
Performance Comparison: utf8*_ci vs. Bin Collations for LIKE Operations (TiFlash + TPCH Dataset)
1. Test Background & Objective
Verify the performance difference between utf8_ci collations (case-insensitive, UTF-8-aware)* and bin collations (binary, byte-level) when executing LIKE
pattern matching queries in TiFlash (TiDB's columnar storage engine). The core focus is on comparing query latency under the two collation types, with specific attention to the overhead of UTF-8 processing in utf8*_ci collations.
2. Test Environment & Configuration
Configuration Item | Details |
---|---|
Dataset | TPCH 10G Dataset (only the orders table is used, which contains the text field o_comment stored in UTF-8) |
Execution Engine | TiFlash (TiDB Columnar Storage Engine) |
Concurrency Control | Restrict TiFlash to 1 concurrent thread via set tidb_max_tiflash_threads=1; to eliminate multi-threading interference |
Test Field | orders.o_comment (UTF-8 text field used for LIKE pattern matching) |
Collation Types | Two collations are compared: - utf8*_ci Collation: UTF-8-aware case-insensitive collations (e.g., utf8mb4_general_ci , utf8mb4_unicode_ci )- Bin Collation: Binary collations (e.g., utf8mb4_bin ) |
3. Test Steps
- Environment Initialization: Load the TPCH 10G Dataset and ensure the
orders
table has complete, UTF-8-encoded data. - Concurrency Configuration: Execute the following command in the TiDB client to fix TiFlash's concurrency and isolate performance variables:
set tidb_max_tiflash_threads=1;
- Query Execution (utf8_ci Collation Scenario)*: Set the
o_comment
field to a utf8*_ci collation (e.g.,utf8mb4_general_ci
), then run the coreLIKE
query from TPCH Query 13 (Q13):SELECT COUNT(*) FROM orders WHERE o_comment LIKE "%special%requests%";
- Query Execution (Bin Collation Scenario): Change the
o_comment
field to a bin collation (e.g.,utf8mb4_bin
), then re-run the same query from Step 3 to ensure consistency. - Latency Measurement: Record the total execution time (latency) of the query in both scenarios (from request initiation to result return), averaging 3 runs to minimize variance.
4. Test Results
Collation Type | Collation Example | Average Query Latency |
---|---|---|
utf8*_ci Collation | utf8mb4_general_ci |
~5 seconds |
Bin Collation | utf8mb4_bin |
~0.7 seconds |
5. Conclusion
Under TiFlash's single-threaded concurrency, for the LIKE "%special%requests%"
pattern matching query on the UTF-8-encoded orders.o_comment
field:
- The latency of bin collations (~0.7 seconds) is only 14% of that of utf8*_ci collations (~5 seconds), demonstrating a dramatic performance advantage.
- Primary Root Cause: utf8*_ci collations incur far higher overhead because they require UTF-8 decoding and character normalization as a prerequisite for comparison. Raw UTF-8 byte sequences must first be decoded into logical Unicode characters; additional processing is then needed to handle character equivalence (e.g., case folding like
A
vsa
, or accented character normalization likeé
vse´
). This two-step Unicode-aware processing adds substantial computational cost, especially for large text fields in bulk queries. - By contrast, bin collations bypass all Unicode-related overhead: they compare strings directly as raw UTF-8 byte sequences, with no decoding, normalization, or case-insensitive checks. This byte-level comparison logic is minimal and efficient, resulting in drastically lower latency for
LIKE
operations.
Metadata
Metadata
Assignees
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.