-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix(block_list): Speed up Split method #5748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(block_list): Speed up Split method #5748
Conversation
Before:
After:
|
src/core/search/block_list.cc
Outdated
|
||
std::nth_element(all_entries.begin(), all_entries.begin() + initial_size / 2, all_entries.end(), | ||
[](const Entry& l, const Entry& r) { return l.second < r.second; }); | ||
std::nth_element(entries_indexes.begin(), entries_indexes.begin() + elements_count / 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. Are all elements already sorted by the score? you use PushBack because they are sorted?
In that case, why do you need to partial sort entries_indexes
here? would not it be just advancing the iterator of block_list by elements_count / 2
steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elements are std::pair<DocId, double>
, meaning we store a numeric value associated with its doc id. In the BlockList
, elements are naturally sorted by DocId
. Therefore, to find the median, we need to "sort" them by their values instead
e803b68
to
8a9817a
Compare
Signed-off-by: Stepan Bagritsevich <[email protected]>
Signed-off-by: Stepan Bagritsevich <[email protected]>
8a9817a
to
2135443
Compare
In the
Split
method we had a rather inefficient algorithm.First, we used
Insert
, which performs a binary search to find the block and then inserts a new entry. However, during splitting we already have the document IDs sorted, meaning every new entry is simply appended to the end of the block list.The second issue was that
Insert
always triggered block splitting insideBlockList
, which is a linear operation. Together, these introduced significant overhead in the split method.To address this, I added a
PushBack
method toBlockList
that bypasses binary search and avoids block splitting.