Implemented casting for RunEnd Encoding #7713

rich-t-kid-datadog · 2025-06-19T03:55:27Z

Which issue does this PR close?

This PR contributes towards the larger epic Implement RunArray operations

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:

Casting FROM RunEndEncoded arrays - Converting the values within a RunEndEncoded array to different data types while preserving the run structure
Casting TO RunEndEncoded arrays - Converting regular arrays into RunEndEncoded format by performing run-end encoding
Full integration with Arrow's casting system - Making RunEndEncoded arrays work with the existing cast() and can_cast_types() functions

Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.

Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:

[1000, -15536, 14464] // due to overflow
Such sequences break the REE invariant and could cause panics or silent data corruption downstream.

This would violate the REE invariant and may cause panics or silent data corruption downstream.
Arrow’s CastOptions normally allow safe = false to skip overflow checks and instead produce nulls. However, for REE arrays, such behavior is unsafe and incompatible with the strict invariants required by run-end indices.

We chose to hard-code safe:True behavior for run-end casting.

This ensures that:

Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow — even when safe = false is set by the user
- Narrowing conversions (e.g., from Int64 to Int16) will always fail if any values exceed the target type’s bounds — even if the user explicitly sets safe = false
Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless.
- Widening conversions (e.g., from Int16 to Int64) are allowed, as they are inherently lossless

This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.

What changes are included in this PR?

run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types
cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding
Comprehensive test suite covering various data types and edge cases
Updated can_cast_types() to support RunEndEncoded compatibility rules
Run_End down casting is not allowed.
Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function
All existing APIs remain unchanged.
There are no breaking changes from this PR, this is a purely additive change.

alamb · 2025-06-19T11:37:03Z

FYI @brancz and @alexanderbianchi

alexanderbianchi · 2025-06-19T20:06:49Z

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄
sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

alamb · 2025-06-20T13:56:04Z

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄 sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

No worries -- let me know when it is ready for a review and I'll take a closer look

THanks for helping drive this forward

arrow-cast/src/cast/run_array.rs

arrow-cast/src/cast/mod.rs

arrow-cast/src/cast/run_array.rs

arrow-cast/src/cast/mod.rs

arrow-cast/src/cast/run_array.rs

gabotechs · 2025-06-26T14:46:39Z

arrow-cast/src/cast/run_array.rs

+        // For simplicity, we'll use a basic comparison approach
+        // In practice, you'd want more sophisticated comparison based on data type
+        let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
+            (true, true) => true, // Both null
+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation
+                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
+            }
+            _ => false, // One null, one not null
+        };


It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here

Im not sure what you mean. Theres a check for null values and then the actual value at that index is compared. using the test I wrote I can confirm that it works as intended. can_cast_run_end_encoded also restricts what types can be passed to this function.

What I meant is that in the comment you mention:

// For simplicity, we'll use a basic comparison approach // In practice, you'd want more sophisticated comparison based on data type

As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.

If you think it's enough to rely on can_cast_run_end_encoded and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:

// We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]

arrow-cast/src/cast/run_array.rs

arrow-cast/src/cast/mod.rs

arrow-cast/src/cast/run_array.rs

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

gabotechs

Really nice! left some comments, but none are blocking, so +1

arrow-cast/src/cast/mod.rs

arrow-cast/src/cast/run_array.rs

arrow-cast/src/cast/mod.rs

gabotechs · 2025-07-03T07:33:00Z

arrow-cast/src/cast/run_array.rs

+        // For simplicity, we'll use a basic comparison approach
+        // In practice, you'd want more sophisticated comparison based on data type
+        let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
+            (true, true) => true, // Both null
+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation
+                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
+            }
+            _ => false, // One null, one not null
+        };


What I meant is that in the comment you mention:

// For simplicity, we'll use a basic comparison approach // In practice, you'd want more sophisticated comparison based on data type

As this is intended to be merged to the main branch of https://github.com/apache/arrow-rs, we are pretty much already in the "In practice" case here.

gabotechs · 2025-07-03T07:35:54Z

arrow-cast/src/cast/run_array.rs

+        // For simplicity, we'll use a basic comparison approach
+        // In practice, you'd want more sophisticated comparison based on data type
+        let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
+            (true, true) => true, // Both null
+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation
+                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
+            }
+            _ => false, // One null, one not null
+        };


If you think it's enough to rely on can_cast_run_end_encoded and just do the simple byte-based comparison here then lets go with that 👍 I'd just change the "In practice" comment to something like:

// We can afford to perform the simple comparison here as we already validated the type in [can_cast_run_end_encoded]

alamb · 2025-07-24T18:39:58Z

@brancz as you have been working with the REE code, do you have time to help review this PR to potentially get it into

Release arrow-rs / parquet Major version 56.0.0 (July 2025) #7395

?

rich-t-kid-datadog · 2025-07-28T01:27:08Z

It seems I forgot to run cargo fmt before pushing this up. That seems to be why the CI is failing.

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

brancz · 2025-08-16T11:02:27Z

arrow-cast/src/cast/run_array.rs

+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation
+                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()


I realize this is simple to do this way, but this causes every element to be copied twice for every cast. I think much like with dictionary casting, it's unrealistic to support arbitrarily nested types (just like the can_cast_to_run_end_encoded correctly returns already), but for all non-nested types we should have specific implementations for primitive/decimals/byte/utf types. Then it's also trivial to do zero-copy comparisons for those types.

Okay that make sense, the soultion to this would be to implement different functions for each of the accepted Datatypes? Im a little confused on why this causes each element to be copied twice for each cast. doesnt slice return a zero-copy slice of the underlying array and to_data return the raw data?

brancz · 2025-08-16T11:04:19Z

arrow-cast/src/cast/run_array.rs

+            (true, true) => true, // Both null
+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation


As AIs like claude code like to do things like this comment (and frankly, various other comments and the PR description) and the simplified code below, it really makes me think this entire PR was AI-generated. If this is the case, please at least state this in the PR description.

Yes I wanted to have lots of documentation for the decions I was making so I asked AI for lots of comments to make the code self documenting.The entire PR was not AI generate but it was used.

brancz · 2025-08-16T11:10:29Z

arrow-cast/src/cast/run_array.rs

+    // Step 1: Cast the input array to the target value type if necessary
+    let cast_array = if array.data_type() == value_type {
+        // No casting needed, use the array as-is
+        make_array(array.to_data())


this copy is not necessary, you can replace this with just array and then use &cast_with_options(array, value_type, cast_options)? in the other branch

brancz · 2025-08-22T07:54:57Z

@alexanderbianchi @gabotechs If I'm not mistaken, I think @rich-t-kid-datadog's internship at DataDog has ended. What does that mean for this PR? Are you going to finish it, or should someone from the community pick it up and get it over the finish line?

I'm just asking, as if no one is going to work on this soon, we'd like to pick it up as it's required for apache/datafusion#16011.

vegarsti · 2025-08-22T08:00:32Z

@alexanderbianchi @gabotechs If I'm not mistaken, I think @rich-t-kid-datadog's internship at DataDog has ended. What does that mean for this PR? Are you going to finish it, or should someone from the community pick it up and get it over the finish line?

I'm just asking, as if no one is going to work on this soon, we'd like to pick it up as it's required for apache/datafusion#16011.

I would be happy to help if you want, @brancz! But fair if it's faster for you to do it

brancz · 2025-08-22T08:26:05Z

I'd happily help with guidance @vegarsti, we're not in a huge hurry, and I still have to clean up the datafusion side of the code anyway as well, so if someone else is interested in working on this other than me/us, then I'll happily take them up on the offer! :)

Rich-T-kid · 2025-08-22T17:37:07Z

Yes ill continue to work on the REE work after my internship. I will lose access to my datadog github account for the time being but ill work on it in my personal account!

brancz · 2025-08-26T15:51:40Z

Awesome! Do you have an ETA for when you think you'll get the next iteration out?

Rich-T-kid · 2025-08-27T04:01:02Z

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

alamb · 2025-08-27T12:09:05Z

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

Sadly, our most limited resource on these projects is reviewer bandwidth. It would be the most helpful if you could pick one or two of your outstanding PRs and we can focus on getting them reviewed and merged.

Also, something that helps with review speed / turnaround is making smaller PRs -- as it takes less contiguous time to review them

alamb · 2025-08-27T12:09:17Z

Thank you for helping push this project forward @Rich-T-kid

brancz · 2025-08-27T13:00:48Z

I'm also spread very thin, but very happy to help review anything that enables REE arrays in group-by's of aggregations (like this PR)!

Yeah, I’d say about a week at most. I’m still waiting to get my new MacBook, and I cant open VS code on my current laptop. Hopefully by Friday morning, but at the latest within a week. This also applies to my other PRs—most haven’t received comments yet, but for the ones that do, I’ll make sure to respond as well.

Awesome, looking forward to it!

part of #863 Because some OTAP fields are optional, in a stream of record batches we may receive subsequent batches with different schemas. Parquet doesn't support having row groups with different sets of column chunks, which means we need to know the schema a-priori when the writer is created. This PR adds code to normalize the schema of the record batch before writing by: - putting all the fields in the same order - creating all null/default value columns for any missing column The missing columns should have a small overhead when written to disk, because parquet will either write an entirely empty column chunk for the null column (all null count, no data), or and for all default-value columns, parquet will use dictionary and RLE encoding by default, leading to a small column chunk with a single value value in dict & a single run for the key. What's unfortunate is that we still materialize an all-null column before writing with the length of the record batch. This can be optimized when run-end encoded arrays are supported in parquet, because we could just create a run array with a single run of null/default value. The arrow community is currently working on adding support (see apache/arrow-rs#7713 & apache/arrow-rs#8069). --------- Co-authored-by: Laurent Quérel <[email protected]>

alamb · 2025-09-04T18:27:49Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

brancz · 2025-09-12T15:13:59Z

@Rich-T-kid are you still planning on getting back to this?

Rich-T-kid · 2025-09-12T16:50:43Z

Yes, I removed the double copying for each cast. Im implementing the cast operation for each supported type in groups. Im just a bit busy as im taking 20 credits this semester with a job so im working on this when I get a chance. I know this PR's been in review for a bit so I want to make sure this final iteration is good. Sorry about the delay!

Implemented casting for RunEnd Encoding

4c5b644

github-actions bot added the arrow Changes to the arrow crate label Jun 19, 2025

alamb mentioned this pull request Jun 19, 2025

[Epic] Implement RunArray (Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520

Open

14 tasks

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Show resolved Hide resolved

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from ad373e9 to 1c28b08 Compare June 20, 2025 18:01

Implemented casting for RunEnd Encoding

5307851

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from 1c28b08 to 5307851 Compare June 23, 2025 14:09

alexanderbianchi reviewed Jun 23, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 24, 2025

View reviewed changes

arrow-cast/src/cast/mod.rs Show resolved Hide resolved

gabotechs reviewed Jun 30, 2025

View reviewed changes

feat: Add Run-End Encoded array casting with overflow protection

0452360

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from 7d6df4f to 0452360 Compare July 1, 2025 18:48

gabotechs approved these changes Jul 3, 2025

View reviewed changes

rich-t-kid-datadog marked this pull request as ready for review July 3, 2025 16:37

rich-t-kid-datadog changed the title ~~[Draft]Implemented casting for RunEnd Encoding~~ Implemented casting for RunEnd Encoding Jul 3, 2025

alamb mentioned this pull request Jul 24, 2025

Release arrow-rs / parquet Major version 56.0.0 (July 2025) #7395

Closed

14 tasks

feat: Add Run-End Encoded array casting with overflow protection

2642ecd

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from fed30f0 to 2642ecd Compare July 30, 2025 13:21

vegarsti mentioned this pull request Aug 10, 2025

Convert RunEndEncoded to Parquet #8069

Draft

brancz suggested changes Aug 16, 2025

View reviewed changes

albertlockett mentioned this pull request Aug 28, 2025

Parquet exporter handle optional fields open-telemetry/otel-arrow#1024

Merged

alamb marked this pull request as draft September 4, 2025 18:27

Implemented casting for RunEnd Encoding #7713

Are you sure you want to change the base?

Implemented casting for RunEnd Encoding #7713

Conversation

rich-t-kid-datadog commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Uh oh!

alamb commented Jun 19, 2025

Uh oh!

alexanderbianchi commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 24, 2025

Uh oh!

rich-t-kid-datadog commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brancz commented Aug 22, 2025

Uh oh!

vegarsti commented Aug 22, 2025

Uh oh!

brancz commented Aug 22, 2025

Uh oh!

Rich-T-kid commented Aug 22, 2025

Uh oh!

rich-t-kid-datadog commented Jun 19, 2025 •

edited

Loading

alexanderbianchi commented Jun 19, 2025 •

edited

Loading

brancz commented Aug 27, 2025 •

edited

Loading

Rich-T-kid commented Sep 12, 2025 •

edited

Loading