[EPIC] Faster performance for parquet predicate evaluation for non selective filters

## Is your feature request related to a problem or challenge? Please describe what you are trying to do.

- Related to https://github.com/apache/datafusion/issues/3463 in DataFusion.

When evaluating filters on data stored in parquet, you can:
1. Use the [`with_row_filter`] API to apply predicates during the scan
2. Read the data and apply the predicate using the [`filter`] kernel afterwards

Currently, it is faster to use [`with_row_filter`] for some predicates and [`filter`] for others. In DataFusion we have a configuration setting to choose between the strategies (`filter_pushdown`, see https://github.com/apache/datafusion/issues/3463) but that is a bad UX as it
means the user must somehow know which strategy to choose, but the strategy changes 

In general the queries that are slower when [`with_row_filter`] is used:
1. The predicates are not very selective (e.g. they pass more than 1% of the rows)
2. The filters are applied to columns which are also used in the query result (e.g. the a filter column is also in the projection)

### More Background:

The predicates are provides as a [`RowFilter`] (see docs for more details)

> [RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html) applies predicates in order, after decoding only the columns required. As predicates eliminate rows, fewer rows from subsequent columns may be required, thus potentially reducing IO and decode.

[`with_row_filter`]: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter
[`filter`]: https://docs.rs/arrow/latest/arrow/compute/kernels/filter/index.html
[`RowFilter`]: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html

## Describe the solution you'd like

I would like the evaluation of predicates in `RowFilter` (aka pushed down predicates) to never be worse than decoding the columns first and then filtering them with the `filter` kernel

We have added a benchmark https://github.com/apache/arrow-rs/pull/7401, which hopefully can 

```shell
cargo bench --all-features --bench arrow_reader_row_filter
```

**Describe alternatives you've considered**
This goal will likely require several changes to the codebase. Here are some options:
- [x] https://github.com/apache/arrow-rs/pull/7401
- [x] https://github.com/apache/arrow-rs/issues/7460
- [x] https://github.com/apache/arrow-rs/issues/7589
- [x] https://github.com/apache/arrow-rs/issues/7363
- [ ] https://github.com/apache/arrow-rs/issues/5523
- [ ] https://github.com/apache/arrow-rs/issues/7450
- [ ] https://github.com/apache/arrow-rs/issues/7458
- [ ] https://github.com/apache/arrow-rs/issues/4864


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

More Background:

Describe the solution you'd like

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Description

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

More Background:

Describe the solution you'd like

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions