Skip to content

Conversation

rich-t-kid-datadog
Copy link

@rich-t-kid-datadog rich-t-kid-datadog commented Jun 19, 2025

Which issue does this PR close?

This PR contributes towards the larger epic Implement RunArray operations

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:

  1. Casting FROM RunEndEncoded arrays - Converting the values within a RunEndEncoded array to different data types while preserving the run structure
  2. Casting TO RunEndEncoded arrays - Converting regular arrays into RunEndEncoded format by performing run-end encoding
  3. Full integration with Arrow's casting system - Making RunEndEncoded arrays work with the existing cast() and can_cast_types() functions

Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.

Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:

[1000, -15536, 14464] // due to overflow
Such sequences break the REE invariant and could cause panics or silent data corruption downstream.

This would violate the REE invariant and may cause panics or silent data corruption downstream.
Arrow’s CastOptions normally allow safe = false to skip overflow checks and instead produce nulls. However, for REE arrays, such behavior is unsafe and incompatible with the strict invariants required by run-end indices.

We chose to hard-code safe:True behavior for run-end casting.

This ensures that:

  • Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow — even when safe = false is set by the user

    • Narrowing conversions (e.g., from Int64 to Int16) will always fail if any values exceed the target type’s bounds — even if the user explicitly sets safe = false
  • Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless.

    • Widening conversions (e.g., from Int16 to Int64) are allowed, as they are inherently lossless

This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.

What changes are included in this PR?

  1. run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types

  2. cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding

  3. Comprehensive test suite covering various data types and edge cases

  4. Updated can_cast_types() to support RunEndEncoded compatibility rules

  5. Run_End down casting is not allowed.

  6. Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function

  7. All existing APIs remain unchanged.
    There are no breaking changes from this PR, this is a purely additive change.

@alamb
Copy link
Contributor

alamb commented Jun 19, 2025

FYI @brancz and @alexanderbianchi

@alexanderbianchi
Copy link

alexanderbianchi commented Jun 19, 2025

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄
sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄 sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

No worries -- let me know when it is ready for a review and I'll take a closer look

THanks for helping drive this forward

Comment on lines 130 to 183
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
}
_ => false, // One null, one not null
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure what you mean. Theres a check for null values and then the actual value at that index is compared. using the test I wrote I can confirm that it works as intended. can_cast_run_end_encoded also restricts what types can be passed to this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that in the comment you mention:

        // For simplicity, we'll use a basic comparison approach
        // In practice, you'd want more sophisticated comparison based on data type

As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's enough to rely on can_cast_run_end_encoded and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:

        // We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]

Implement casting between REE arrays and other Arrow types. REE-to-REE casting
validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent
invalid sequences.
Copy link

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! left some comments, but none are blocking, so +1

Comment on lines 130 to 183
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
}
_ => false, // One null, one not null
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that in the comment you mention:

        // For simplicity, we'll use a basic comparison approach
        // In practice, you'd want more sophisticated comparison based on data type

As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.

Comment on lines 130 to 183
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
}
_ => false, // One null, one not null
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's enough to rely on can_cast_run_end_encoded and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:

        // We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]

@rich-t-kid-datadog rich-t-kid-datadog marked this pull request as ready for review July 3, 2025 16:37
@rich-t-kid-datadog rich-t-kid-datadog changed the title [Draft]Implemented casting for RunEnd Encoding Implemented casting for RunEnd Encoding Jul 3, 2025
@alamb
Copy link
Contributor

alamb commented Jul 24, 2025

@brancz as you have been working with the REE code, do you have time to help review this PR to potentially get it into

?

@rich-t-kid-datadog
Copy link
Author

It seems I forgot to run cargo fmt before pushing this up. That seems to be why the CI is failing.

Implement casting between REE arrays and other Arrow types. REE-to-REE casting
validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent
invalid sequences.

rebased changes
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is simple to do this way, but this causes every element to be copied twice for every cast. I think much like with dictionary casting, it's unrealistic to support arbitrarily nested types (just like the can_cast_to_run_end_encoded correctly returns already), but for all non-nested types we should have specific implementations for primitive/decimals/byte/utf types. Then it's also trivial to do zero-copy comparisons for those types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that make sense, the soultion to this would be to implement different functions for each of the accepted Datatypes? Im a little confused on why this causes each element to be copied twice for each cast. doesnt slice return a zero-copy slice of the underlying array and to_data return the raw data?

(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As AIs like claude code like to do things like this comment (and frankly, various other comments and the PR description) and the simplified code below, it really makes me think this entire PR was AI-generated. If this is the case, please at least state this in the PR description.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I wanted to have lots of documentation for the decions I was making so I asked AI for lots of comments to make the code self documenting.The entire PR was not AI generate but it was used.

// Step 1: Cast the input array to the target value type if necessary
let cast_array = if array.data_type() == value_type {
// No casting needed, use the array as-is
make_array(array.to_data())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this copy is not necessary, you can replace this with just array and then use &cast_with_options(array, value_type, cast_options)? in the other branch

@brancz
Copy link
Contributor

brancz commented Aug 22, 2025

@alexanderbianchi @gabotechs If I'm not mistaken, I think @rich-t-kid-datadog's internship at DataDog has ended. What does that mean for this PR? Are you going to finish it, or should someone from the community pick it up and get it over the finish line?

I'm just asking, as if no one is going to work on this soon, we'd like to pick it up as it's required for apache/datafusion#16011.

@vegarsti
Copy link

@alexanderbianchi @gabotechs If I'm not mistaken, I think @rich-t-kid-datadog's internship at DataDog has ended. What does that mean for this PR? Are you going to finish it, or should someone from the community pick it up and get it over the finish line?

I'm just asking, as if no one is going to work on this soon, we'd like to pick it up as it's required for apache/datafusion#16011.

I would be happy to help if you want, @brancz! But fair if it's faster for you to do it

@brancz
Copy link
Contributor

brancz commented Aug 22, 2025

I'd happily help with guidance @vegarsti, we're not in a huge hurry, and I still have to clean up the datafusion side of the code anyway as well, so if someone else is interested in working on this other than me/us, then I'll happily take them up on the offer! :)

@Rich-T-kid
Copy link

Yes ill continue to work on the REE work after my internship. I will lose access to my datadog github account for the time being but ill work on it in my personal account!

@brancz
Copy link
Contributor

brancz commented Aug 26, 2025

Awesome! Do you have an ETA for when you think you'll get the next iteration out?

@Rich-T-kid
Copy link

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

@alamb
Copy link
Contributor

alamb commented Aug 27, 2025

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

Sadly, our most limited resource on these projects is reviewer bandwidth. It would be the most helpful if you could pick one or two of your outstanding PRs and we can focus on getting them reviewed and merged.

Also, something that helps with review speed / turnaround is making smaller PRs -- as it takes less contiguous time to review them

@alamb
Copy link
Contributor

alamb commented Aug 27, 2025

Thank you for helping push this project forward @Rich-T-kid

@brancz
Copy link
Contributor

brancz commented Aug 27, 2025

I'm also spread very thin, but very happy to help review anything that enables REE arrays in group-by's of aggregations (like this PR)!

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

Awesome, looking forward to it!

github-merge-queue bot pushed a commit to open-telemetry/otel-arrow that referenced this pull request Aug 28, 2025
part of #863 

Because some OTAP fields are optional, in a stream of record batches we
may receive subsequent batches with different schemas. Parquet doesn't
support having row groups with different sets of column chunks, which
means we need to know the schema a-priori when the writer is created.

This PR adds code to normalize the schema of the record batch before
writing by:
- putting all the fields in the same order
- creating all null/default value columns for any missing column

The missing columns should have a small overhead when written to disk,
because parquet will either write an entirely empty column chunk for the
null column (all null count, no data), or and for all default-value
columns, parquet will use dictionary and RLE encoding by default,
leading to a small column chunk with a single value value in dict & a
single run for the key.

What's unfortunate is that we still materialize an all-null column
before writing with the length of the record batch. This can be
optimized when run-end encoded arrays are supported in parquet, because
we could just create a run array with a single run of null/default
value. The arrow community is currently working on adding support (see
apache/arrow-rs#7713 &
apache/arrow-rs#8069).

---------

Co-authored-by: Laurent Quérel <[email protected]>
@alamb alamb marked this pull request as draft September 4, 2025 18:27
@alamb
Copy link
Contributor

alamb commented Sep 4, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@brancz
Copy link
Contributor

brancz commented Sep 12, 2025

@Rich-T-kid are you still planning on getting back to this?

@Rich-T-kid
Copy link

Rich-T-kid commented Sep 12, 2025

Yes, I removed the double copying for each cast. Im implementing the cast operation for each supported type in groups. Im just a bit busy as im taking 20 credits this semester with a job so im working on this when I get a chance. I know this PR's been in review for a bit so I want to make sure this final iteration is good. Sorry about the delay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants