Benchmark for filter+concat and take+concat into even sized record batches

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
It is a common pattern in DataFusion to take some subset of rows from an input stream of RecordBatches and produces evenly sized (e.g. 8192 bytes) output RecordBatches. The subset is either described by a BooleanArray (`filter`) or a set of indices (`take`) This is explained here:
- https://github.com/apache/arrow-rs/issues/6692

I have hit the need for the same low level primitive while working on improving parquet filtering performance
- https://github.com/apache/arrow-rs/issues/7456

The current way to accomplish this is to apply the `filter` kernel to make several smaller RecordBatches and then `concat` to put them all together. This mechanism has two disavantages:
1. It copies the data twice (which is especially slow for Utf8/Binary and other variable length arrays) with multiple allocations
2. It requiring *buffering* all the input batches until enough are ready to form the output (which is especially problematic for Utf8View and other view types where the result of filtering may still hold significant amounts of memory, and sometimes requires copying the views more than once (see [here](https://github.com/apache/datafusion/blob/9d2f04996604e709ee440b65f41e7b882f50b788/datafusion/physical-plan/src/coalesce/mod.rs#L222-L221))

However, it turns out that the filter and concat kernels are very highly optimized so getting better performance is actually quite challenging as described on https://github.com/apache/arrow-rs/pull/7513

**Describe the solution you'd like**
I would like to optimize the filter+concat operation, especially for variable length data arrays

To do so let's start with a benchmark

**Describe alternatives you've considered**


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark for filter+concat and take+concat into even sized record batches #7589

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark for filter+concat and take+concat into even sized record batches #7589

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions