-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[feature](function) add approx_top_k aggregation function #44082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Proposed changes 1. select approx_top_k(clientip, status, size, 10, 300) from tbl; This code implements an approximate Top-N query function based on the SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm commonly used to handle frequent element query problems in large datasets. Below is a description of the main functionalities: (1) Data Structures and Memory Management: The SpaceSavingArena class provides a memory pool to manage memory allocation and deallocation. For keys of type StringRef, it handles the memory by copying the string into the memory pool. The Counter struct stores the key, count, and error for each element, and provides serialization and deserialization functions. (2) Insertion and Updates: The insert method is used to insert new elements or update the count of existing elements. If the current capacity is not full, it inserts the new element; if it is full, it replaces the element with the smallest count based on the element's count and error. (3) Merge Operation: The merge method allows merging two SpaceSaving objects. During the merge, it adjusts the counts and errors of the existing elements, ensuring that the result maintains the correct order. (4) Top-K Query: The top_k method returns the current Top-K most frequent elements, sorted by their count and error. (5) Capacity Expansion and Shrinking: The resize method allows adjusting the storage capacity, and it recalculates the size of the alpha_map accordingly. (6) Serialization and Deserialization: The write and read methods are provided for serializing the SpaceSaving structure to disk or reading data from disk. (7) Optimization and Performance: The code uses a hash table-based approach for lookup and storage, and dynamically adjusts the alpha_map size to optimize performance and reduce memory waste. In summary, the SpaceSaving class efficiently implements Top-N queries for large data streams within limited memory, with efficient insertion, updating, and merging mechanisms. Co-authored-by: zzzxl1993 <[email protected]>
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
…3643) Problem Summary: 1. The function approx_top_sum has been implemented. Here is an example of its usage: select approx_top_sum(c1, c2, c3, 10, 300) from tbl. Add new function `approx_top_sum`.
44c4a05
to
6096e39
Compare
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
||
#pragma once | ||
|
||
#include <rapidjson/encodings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]
#include <rapidjson/encodings.h>
^
|
||
#pragma once | ||
|
||
#include <rapidjson/encodings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]
#include <rapidjson/encodings.h>
^
namespace doris::vectorized { | ||
|
||
template <typename T> | ||
inline uint32_t get_leading_zero_bits_unsafe(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'uint32_t' [clang-diagnostic-error]
inline uint32_t get_leading_zero_bits_unsafe(T x) {
^
} | ||
|
||
template <typename T> | ||
inline uint32_t bit_scan_reverse(T x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unknown type name 'uint32_t' [clang-diagnostic-error]
inline uint32_t bit_scan_reverse(T x) {
^
|
||
template <typename T> | ||
inline uint32_t bit_scan_reverse(T x) { | ||
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'std' [clang-diagnostic-error]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^
|
||
template <typename T> | ||
inline uint32_t bit_scan_reverse(T x) { | ||
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'size_t'; did you mean 'sizeof'? [clang-diagnostic-error]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - | |
return (std::max<sizeof>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
|
||
template <typename T> | ||
inline uint32_t bit_scan_reverse(T x) { | ||
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: expected expression [clang-diagnostic-error]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^
|
||
template <typename T> | ||
inline uint32_t bit_scan_reverse(T x) { | ||
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: left operand of comma operator has no effect [clang-diagnostic-unused-value]
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
^
|
||
#pragma once | ||
|
||
#include <boost/range/adaptor/reversed.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'boost/range/adaptor/reversed.hpp' file not found [clang-diagnostic-error]
#include <boost/range/adaptor/reversed.hpp>
^
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include "vec/common/space_saving.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'vec/common/space_saving.h' file not found [clang-diagnostic-error]
#include "vec/common/space_saving.h"
^
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
pick #40813
pick #43643