Skip to content

Conversation

zzzxl1993
Copy link
Contributor

pick #40813
pick #43643

## Proposed changes

1. select approx_top_k(clientip, status, size, 10, 300) from tbl;

This code implements an approximate Top-N query function based on the
SpaceSaving algorithm. SpaceSaving is an efficient streaming algorithm
commonly used to handle frequent element query problems in large
datasets. Below is a description of the main functionalities:
      
      (1) Data Structures and Memory Management:
The SpaceSavingArena class provides a memory pool to manage memory
allocation and deallocation. For keys of type StringRef, it handles the
memory by copying the string into the memory pool.
The Counter struct stores the key, count, and error for each element,
and provides serialization and deserialization functions.

      (2) Insertion and Updates:
The insert method is used to insert new elements or update the count of
existing elements. If the current capacity is not full, it inserts the
new element; if it is full, it replaces the element with the smallest
count based on the element's count and error.

      (3) Merge Operation:
The merge method allows merging two SpaceSaving objects. During the
merge, it adjusts the counts and errors of the existing elements,
ensuring that the result maintains the correct order.

      (4) Top-K Query:
The top_k method returns the current Top-K most frequent elements,
sorted by their count and error.

      (5) Capacity Expansion and Shrinking:
The resize method allows adjusting the storage capacity, and it
recalculates the size of the alpha_map accordingly.

      (6) Serialization and Deserialization:
The write and read methods are provided for serializing the SpaceSaving
structure to disk or reading data from disk.

      (7) Optimization and Performance:
The code uses a hash table-based approach for lookup and storage, and
dynamically adjusts the alpha_map size to optimize performance and
reduce memory waste.

In summary, the SpaceSaving class efficiently implements Top-N queries
for large data streams within limited memory, with efficient insertion,
updating, and merging mechanisms.

Co-authored-by: zzzxl1993 <[email protected]>
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zzzxl1993
Copy link
Contributor Author

run buildall

…3643)

Problem Summary:

1. The function approx_top_sum has been implemented. Here is an example
of its usage: select approx_top_sum(c1, c2, c3, 10, 300) from tbl.

Add new function `approx_top_sum`.
@zzzxl1993 zzzxl1993 force-pushed the branch-3.0.202409131832 branch from 44c4a05 to 6096e39 Compare November 17, 2024 03:27
@zzzxl1993
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


#pragma once

#include <rapidjson/encodings.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h>
         ^


#pragma once

#include <rapidjson/encodings.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'rapidjson/encodings.h' file not found [clang-diagnostic-error]

#include <rapidjson/encodings.h>
         ^

namespace doris::vectorized {

template <typename T>
inline uint32_t get_leading_zero_bits_unsafe(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unknown type name 'uint32_t' [clang-diagnostic-error]

inline uint32_t get_leading_zero_bits_unsafe(T x) {
       ^

}

template <typename T>
inline uint32_t bit_scan_reverse(T x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unknown type name 'uint32_t' [clang-diagnostic-error]

inline uint32_t bit_scan_reverse(T x) {
       ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'std' [clang-diagnostic-error]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
            ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'size_t'; did you mean 'sizeof'? [clang-diagnostic-error]

Suggested change
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
return (std::max<sizeof>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: expected expression [clang-diagnostic-error]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
                           ^


template <typename T>
inline uint32_t bit_scan_reverse(T x) {
return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: left operand of comma operator has no effect [clang-diagnostic-unused-value]

    return (std::max<size_t>(sizeof(T), sizeof(unsigned int))) * 8 - 1 -
                             ^


#pragma once

#include <boost/range/adaptor/reversed.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'boost/range/adaptor/reversed.hpp' file not found [clang-diagnostic-error]

#include <boost/range/adaptor/reversed.hpp>
         ^

// specific language governing permissions and limitations
// under the License.

#include "vec/common/space_saving.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'vec/common/space_saving.h' file not found [clang-diagnostic-error]

#include "vec/common/space_saving.h"
         ^

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Nov 18, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

@qidaye qidaye merged commit 15e26de into apache:branch-3.0 Nov 18, 2024
22 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants