-
Notifications
You must be signed in to change notification settings - Fork 1k
Implemented casting for RunEnd Encoding #7713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implemented casting for RunEnd Encoding #7713
Conversation
FYI @brancz and @alexanderbianchi |
Yup! this is my teams intern 😄 |
No worries -- let me know when it is ready for a review and I'll take a closer look THanks for helping drive this forward |
ad373e9
to
1c28b08
Compare
1c28b08
to
5307851
Compare
arrow-cast/src/cast/run_array.rs
Outdated
// For simplicity, we'll use a basic comparison approach | ||
// In practice, you'd want more sophisticated comparison based on data type | ||
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) { | ||
(true, true) => true, // Both null | ||
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation | ||
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | ||
} | ||
_ => false, // One null, one not null | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure what you mean. Theres a check for null values and then the actual value at that index is compared. using the test I wrote I can confirm that it works as intended. can_cast_run_end_encoded
also restricts what types can be passed to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that in the comment you mention:
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it's enough to rely on can_cast_run_end_encoded
and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:
// We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
7d6df4f
to
0452360
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! left some comments, but none are blocking, so +1
arrow-cast/src/cast/run_array.rs
Outdated
// For simplicity, we'll use a basic comparison approach | ||
// In practice, you'd want more sophisticated comparison based on data type | ||
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) { | ||
(true, true) => true, // Both null | ||
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation | ||
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | ||
} | ||
_ => false, // One null, one not null | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that in the comment you mention:
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.
arrow-cast/src/cast/run_array.rs
Outdated
// For simplicity, we'll use a basic comparison approach | ||
// In practice, you'd want more sophisticated comparison based on data type | ||
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) { | ||
(true, true) => true, // Both null | ||
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation | ||
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | ||
} | ||
_ => false, // One null, one not null | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it's enough to rely on can_cast_run_end_encoded
and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:
// We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]
@brancz as you have been working with the REE code, do you have time to help review this PR to potentially get it into ? |
It seems I forgot to run cargo fmt before pushing this up. That seems to be why the CI is failing. |
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes
fed30f0
to
2642ecd
Compare
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation | ||
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is simple to do this way, but this causes every element to be copied twice for every cast. I think much like with dictionary casting, it's unrealistic to support arbitrarily nested types (just like the can_cast_to_run_end_encoded
correctly returns already), but for all non-nested types we should have specific implementations for primitive/decimals/byte/utf types. Then it's also trivial to do zero-copy comparisons for those types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that make sense, the soultion to this would be to implement different functions for each of the accepted Datatypes? Im a little confused on why this causes each element to be copied twice for each cast. doesnt slice
return a zero-copy slice of the underlying array and to_data
return the raw data?
(true, true) => true, // Both null | ||
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As AIs like claude code like to do things like this comment (and frankly, various other comments and the PR description) and the simplified code below, it really makes me think this entire PR was AI-generated. If this is the case, please at least state this in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I wanted to have lots of documentation for the decions I was making so I asked AI for lots of comments to make the code self documenting.The entire PR was not AI generate but it was used.
// Step 1: Cast the input array to the target value type if necessary | ||
let cast_array = if array.data_type() == value_type { | ||
// No casting needed, use the array as-is | ||
make_array(array.to_data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this copy is not necessary, you can replace this with just array
and then use &cast_with_options(array, value_type, cast_options)?
in the other branch
@alexanderbianchi @gabotechs If I'm not mistaken, I think @rich-t-kid-datadog's internship at DataDog has ended. What does that mean for this PR? Are you going to finish it, or should someone from the community pick it up and get it over the finish line? I'm just asking, as if no one is going to work on this soon, we'd like to pick it up as it's required for apache/datafusion#16011. |
I would be happy to help if you want, @brancz! But fair if it's faster for you to do it |
I'd happily help with guidance @vegarsti, we're not in a huge hurry, and I still have to clean up the datafusion side of the code anyway as well, so if someone else is interested in working on this other than me/us, then I'll happily take them up on the offer! :) |
Yes ill continue to work on the REE work after my internship. I will lose access to my datadog github account for the time being but ill work on it in my personal account! |
Awesome! Do you have an ETA for when you think you'll get the next iteration out? |
Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well. |
Sadly, our most limited resource on these projects is reviewer bandwidth. It would be the most helpful if you could pick one or two of your outstanding PRs and we can focus on getting them reviewed and merged. Also, something that helps with review speed / turnaround is making smaller PRs -- as it takes less contiguous time to review them |
Thank you for helping push this project forward @Rich-T-kid |
I'm also spread very thin, but very happy to help review anything that enables REE arrays in group-by's of aggregations (like this PR)!
Awesome, looking forward to it! |
part of #863 Because some OTAP fields are optional, in a stream of record batches we may receive subsequent batches with different schemas. Parquet doesn't support having row groups with different sets of column chunks, which means we need to know the schema a-priori when the writer is created. This PR adds code to normalize the schema of the record batch before writing by: - putting all the fields in the same order - creating all null/default value columns for any missing column The missing columns should have a small overhead when written to disk, because parquet will either write an entirely empty column chunk for the null column (all null count, no data), or and for all default-value columns, parquet will use dictionary and RLE encoding by default, leading to a small column chunk with a single value value in dict & a single run for the key. What's unfortunate is that we still materialize an all-null column before writing with the length of the record batch. This can be optimized when run-end encoded arrays are supported in parquet, because we could just create a run array with a single run of null/default value. The arrow community is currently working on adding support (see apache/arrow-rs#7713 & apache/arrow-rs#8069). --------- Co-authored-by: Laurent Quérel <[email protected]>
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
@Rich-T-kid are you still planning on getting back to this? |
Yes, I removed the double copying for each cast. Im implementing the cast operation for each supported type in groups. Im just a bit busy as im taking 20 credits this semester with a job so im working on this when I get a chance. I know this PR's been in review for a bit so I want to make sure this final iteration is good. Sorry about the delay! |
Which issue does this PR close?
This PR contributes towards the larger epic Implement RunArray operations
Rationale for this change
This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:
Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.
Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:
[1000, -15536, 14464]
// due to overflowSuch sequences break the REE invariant and could cause panics or silent data corruption downstream.
This would violate the REE invariant and may cause panics or silent data corruption downstream.
Arrow’s CastOptions normally allow safe = false to skip overflow checks and instead produce nulls. However, for REE arrays, such behavior is unsafe and incompatible with the strict invariants required by run-end indices.
We chose to hard-code safe:True behavior for run-end casting.
This ensures that:
Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow — even when safe = false is set by the user
Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless.
This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.
What changes are included in this PR?
run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types
cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding
Comprehensive test suite covering various data types and edge cases
Updated can_cast_types() to support RunEndEncoded compatibility rules
Run_End down casting is not allowed.
Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function
All existing APIs remain unchanged.
There are no breaking changes from this PR, this is a purely additive change.