-
Notifications
You must be signed in to change notification settings - Fork 415
Improve like perf for utf8 ci collations #10400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
984925f
to
7c9ce8b
Compare
/retest |
7c9ce8b
to
a23785c
Compare
/retest |
1 similar comment
/retest |
dbms/src/TiDB/Collation/Collator.cpp
Outdated
} | ||
|
||
template <typename Collator> | ||
void Pattern<Collator>::try_compile_ascii_ci(const std::string & pattern, char escape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use camelCase for function names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
dbms/src/TiDB/Collation/Collator.cpp
Outdated
auto ret = try_match_ascii_ci(s, length); | ||
if (ret == 1) | ||
{ | ||
return true; | ||
} | ||
else if (ret == 0) | ||
{ | ||
return false; | ||
} | ||
// if ret == -1, means the string contains non-ASCII characters, continue to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can simplify to something like:
if (auto ret = try_match_ascii_ci(s, length); ret >= 0) {
return ret;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
dbms/src/TiDB/Collation/Collator.h
Outdated
virtual ~IPattern() = default; | ||
|
||
virtual void compile(const std::string & pattern, char escape) = 0; | ||
virtual void try_compile_ascii_ci(const std::string & pattern, char escape) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camelCase
else | ||
{ | ||
if (s[str_idx] < 0) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
dbms/src/TiDB/Collation/Collator.cpp
Outdated
return -1; | ||
} | ||
if ((match_types[p_idx] == Match | ||
&& weight_ascii_ci[s[str_idx]] == weight_ascii_ci[ascii_ci_pattern[p_idx]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just save the converted pattern in ascii_ci_pattern
during try_compile_ascii
, then during match
, there is no need to lookup the weight_ascii_ci
for pattern chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
44cee58
to
fe39c84
Compare
fe39c84
to
091d9c2
Compare
const std::string & s = inner_c.first; | ||
res[idx] = pattern->match(s.data(), s.length()); | ||
bool ans = std::get<Collator::collation_case>(inner_c.second); | ||
std::cout << "Pattern case (" << p << ", " << inner_c.first << ", " << ans << ")" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code is from the original line 352. To facilitate locating specific cases when errors occur, let's leave it unchanged this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: solotzg, windtalker The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/retest |
2 similar comments
/retest |
/retest |
What problem does this PR solve?
Issue Number: close #10399
Problem Summary:
When performing a LIKE operation on strings in SQL, the underlying code logic varies depending on the collation used, leading to significant performance differences.
Currently, the LIKE operation performs well with BIN-type collations, but its performance with CI-type collations is poor—with a substantial gap compared to BIN-type collations.
What is changed and how it works?
Key Optimizations
1. Prioritize ASCII_CI Comparison for LIKE Patterns
Check if the LIKE pattern consists of only ASCII characters first. If so, prioritize case-insensitive (CI) comparison using the small (128-entry) ASCII character table (avoiding expensive UTF-8 decoding initially). Fall back to UTF-8 decoding only when non-ASCII characters are detected during matching.
2. Optimize Backtracking Logic for Patterns Starting with "%"
When a pattern starts with
%
, restart backtracking directly from the characters after%
(instead of restarting from the%
itself). For scenarios where the first character after backtracking does not match, simplify the flow by immediately advancing the string pointer to skip redundant loops, significantly reducing total operations.Benchmark
To eliminate interference and facilitate comparison, a single-node TiFlash was used, and single concurrency was configured with the following setting:
set tidb_max_tiflash_threads=1;
For the query (under TPCH 10G data):
SELECT COUNT(*) FROM orders WHERE o_comment LIKE ?;
The comparison of execution time before and after optimization is as follows (unit: seconds):
"%ZZZ%";
"%ZZZ%";
"%中国%";
"%中国%";
"%special%requests%" (From TPCH Q13)
"%special%requests%" (From TPCH Q13)
Check List
Tests
Side effects
Documentation
Release note