Skip to content

Benchmark for filter+concat and take+concat into even sized record batches #7589

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is a common pattern in DataFusion to take some subset of rows from an input stream of RecordBatches and produces evenly sized (e.g. 8192 bytes) output RecordBatches. The subset is either described by a BooleanArray (filter) or a set of indices (take) This is explained here:

I have hit the need for the same low level primitive while working on improving parquet filtering performance

The current way to accomplish this is to apply the filter kernel to make several smaller RecordBatches and then concat to put them all together. This mechanism has two disavantages:

  1. It copies the data twice (which is especially slow for Utf8/Binary and other variable length arrays) with multiple allocations
  2. It requiring buffering all the input batches until enough are ready to form the output (which is especially problematic for Utf8View and other view types where the result of filtering may still hold significant amounts of memory, and sometimes requires copying the views more than once (see here)

However, it turns out that the filter and concat kernels are very highly optimized so getting better performance is actually quite challenging as described on #7513

Describe the solution you'd like
I would like to optimize the filter+concat operation, especially for variable length data arrays

To do so let's start with a benchmark

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions