-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is a common pattern in DataFusion to take some subset of rows from an input stream of RecordBatches and produces evenly sized (e.g. 8192 bytes) output RecordBatches. The subset is either described by a BooleanArray (filter
) or a set of indices (take
) This is explained here:
I have hit the need for the same low level primitive while working on improving parquet filtering performance
The current way to accomplish this is to apply the filter
kernel to make several smaller RecordBatches and then concat
to put them all together. This mechanism has two disavantages:
- It copies the data twice (which is especially slow for Utf8/Binary and other variable length arrays) with multiple allocations
- It requiring buffering all the input batches until enough are ready to form the output (which is especially problematic for Utf8View and other view types where the result of filtering may still hold significant amounts of memory, and sometimes requires copying the views more than once (see here)
However, it turns out that the filter and concat kernels are very highly optimized so getting better performance is actually quite challenging as described on #7513
Describe the solution you'd like
I would like to optimize the filter+concat operation, especially for variable length data arrays
To do so let's start with a benchmark
Describe alternatives you've considered
Additional context