Skip to content

Conversation

BagritsevichStepan
Copy link
Contributor

@BagritsevichStepan BagritsevichStepan commented Aug 29, 2025

In the Split method we had a rather inefficient algorithm.

First, we used Insert, which performs a binary search to find the block and then inserts a new entry. However, during splitting we already have the document IDs sorted, meaning every new entry is simply appended to the end of the block list.

The second issue was that Insert always triggered block splitting inside BlockList, which is a linear operation. Together, these introduced significant overhead in the split method.

To address this, I added a PushBack method to BlockList that bypasses binary search and avoids block splitting.

@BagritsevichStepan
Copy link
Contributor Author

BagritsevichStepan commented Aug 29, 2025

Before:

Benchmark Time (ns) CPU (ns) Iterations
BM_SearchRangeTreeSplits/block_size:100000 58,746,631 58,736,587 10
BM_SearchRangeTreeSplits/block_size:1000000 569,836,676 569,791,624 1

After:

Benchmark Time (ns) CPU (ns) Iterations
BM_SearchRangeTreeSplits/block_size:100000 18,841,322 18,837,811 39
BM_SearchRangeTreeSplits/block_size:1000000 193,917,459 193,877,924 4


std::nth_element(all_entries.begin(), all_entries.begin() + initial_size / 2, all_entries.end(),
[](const Entry& l, const Entry& r) { return l.second < r.second; });
std::nth_element(entries_indexes.begin(), entries_indexes.begin() + elements_count / 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Are all elements already sorted by the score? you use PushBack because they are sorted?
In that case, why do you need to partial sort entries_indexes here? would not it be just advancing the iterator of block_list by elements_count / 2 steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elements are std::pair<DocId, double>, meaning we store a numeric value associated with its doc id. In the BlockList, elements are naturally sorted by DocId. Therefore, to find the median, we need to "sort" them by their values instead

@BagritsevichStepan BagritsevichStepan merged commit 0a6fe04 into dragonflydb:main Sep 11, 2025
10 checks passed
@BagritsevichStepan BagritsevichStepan deleted the search/speed-up-split-method branch September 11, 2025 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants