Skip to content

Conversation

wankunde
Copy link

@wankunde wankunde commented Sep 3, 2025

What changes were proposed in this pull request?

For large input rows, the stripe may excessively large , requiring more memory for both reading and writing one strip. We can check the tree write size in bytes and flush the strip even when the input rows count is less than 5000.

Stripes:
  Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304
    Stream: column 0 section ROW_INDEX start: 3 length 12
    Stream: column 1 section ROW_INDEX start: 15 length 110
    Stream: column 2 section ROW_INDEX start: 125 length 893
    Stream: column 3 section ROW_INDEX start: 1018 length 31
    Stream: column 4 section ROW_INDEX start: 1049 length 65
    Stream: column 5 section ROW_INDEX start: 1114 length 923
    Stream: column 6 section ROW_INDEX start: 2037 length 25
    Stream: column 7 section ROW_INDEX start: 2062 length 155
    Stream: column 8 section ROW_INDEX start: 2217 length 28
    Stream: column 9 section ROW_INDEX start: 2245 length 31
    Stream: column 10 section ROW_INDEX start: 2276 length 31
    Stream: column 1 section DATA start: 2307 length 81853
    Stream: column 1 section LENGTH start: 84160 length 2191
    Stream: column 2 section DATA start: 86351 length 345862763
    Stream: column 2 section LENGTH start: 345949114 length 13736
    Stream: column 3 section DATA start: 345962850 length 22
    Stream: column 3 section LENGTH start: 345962872 length 6
    Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5
    Stream: column 4 section PRESENT start: 345962883 length 200
    Stream: column 4 section DATA start: 345963083 length 6322
    Stream: column 4 section LENGTH start: 345969405 length 495
    Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919
    Stream: column 5 section DATA start: 345972819 length 1507883
    Stream: column 5 section LENGTH start: 347480702 length 7346
    Stream: column 6 section DATA start: 347488048 length 22
    Stream: column 6 section LENGTH start: 347488070 length 6
    Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0
    Stream: column 7 section DATA start: 347488076 length 5795
    Stream: column 7 section LENGTH start: 347493871 length 301
    Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187
    Stream: column 8 section DATA start: 347496359 length 22
    Stream: column 8 section LENGTH start: 347496381 length 6
    Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4
    Stream: column 9 section DATA start: 347496391 length 58
    Stream: column 9 section LENGTH start: 347496449 length 6
    Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7
    Stream: column 10 section DATA start: 347496462 length 22
    Stream: column 10 section LENGTH start: 347496484 length 6
    Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DICTIONARY_V2[1]
    Encoding column 4: DICTIONARY_V2[661]
    Encoding column 5: DIRECT_V2
    Encoding column 6: DICTIONARY_V2[1]
    Encoding column 7: DICTIONARY_V2[682]
    Encoding column 8: DICTIONARY_V2[1]
    Encoding column 9: DICTIONARY_V2[2]
    Encoding column 10: DICTIONARY_V2[1]

Why are the changes needed?

To optimize the memory usage.

How was this patch tested?

Local test

Stripe with this change:

  Stripe: offset: 3 data: 69573620 rows: 1024 tail: 227 index: 2245
    Stream: column 0 section ROW_INDEX start: 3 length 12
    Stream: column 1 section ROW_INDEX start: 15 length 111
    Stream: column 2 section ROW_INDEX start: 126 length 914
    Stream: column 3 section ROW_INDEX start: 1040 length 30
    Stream: column 4 section ROW_INDEX start: 1070 length 62
    Stream: column 5 section ROW_INDEX start: 1132 length 848
    Stream: column 6 section ROW_INDEX start: 1980 length 25
    Stream: column 7 section ROW_INDEX start: 2005 length 155
    Stream: column 8 section ROW_INDEX start: 2160 length 28
    Stream: column 9 section ROW_INDEX start: 2188 length 30
    Stream: column 10 section ROW_INDEX start: 2218 length 30
    Stream: column 1 section DATA start: 2248 length 15899
    Stream: column 1 section LENGTH start: 18147 length 478
    Stream: column 2 section DATA start: 18625 length 69245402
    Stream: column 2 section LENGTH start: 69264027 length 2795
    Stream: column 3 section DATA start: 69266822 length 11
    Stream: column 3 section LENGTH start: 69266833 length 6
    Stream: column 3 section DICTIONARY_DATA start: 69266839 length 5
    Stream: column 4 section PRESENT start: 69266844 length 55
    Stream: column 4 section DATA start: 69266899 length 1269
    Stream: column 4 section LENGTH start: 69268168 length 231
    Stream: column 4 section DICTIONARY_DATA start: 69268399 length 1261
    Stream: column 5 section DATA start: 69269660 length 302251
    Stream: column 5 section LENGTH start: 69571911 length 1548
    Stream: column 6 section DATA start: 69573459 length 11
    Stream: column 6 section LENGTH start: 69573470 length 6
    Stream: column 6 section DICTIONARY_DATA start: 69573476 length 0
    Stream: column 7 section DATA start: 69573476 length 1129
    Stream: column 7 section LENGTH start: 69574605 length 168
    Stream: column 7 section DICTIONARY_DATA start: 69574773 length 1030
    Stream: column 8 section DATA start: 69575803 length 11
    Stream: column 8 section LENGTH start: 69575814 length 6
    Stream: column 8 section DICTIONARY_DATA start: 69575820 length 4
    Stream: column 9 section DATA start: 69575824 length 11
    Stream: column 9 section LENGTH start: 69575835 length 6
    Stream: column 9 section DICTIONARY_DATA start: 69575841 length 5
    Stream: column 10 section DATA start: 69575846 length 11
    Stream: column 10 section LENGTH start: 69575857 length 6
    Stream: column 10 section DICTIONARY_DATA start: 69575863 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DICTIONARY_V2[1]
    Encoding column 4: DICTIONARY_V2[266]
    Encoding column 5: DIRECT_V2
    Encoding column 6: DICTIONARY_V2[1]
    Encoding column 7: DICTIONARY_V2[297]
    Encoding column 8: DICTIONARY_V2[1]
    Encoding column 9: DICTIONARY_V2[1]
    Encoding column 10: DICTIONARY_V2[1]

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the JAVA label Sep 3, 2025
@wankunde
Copy link
Author

wankunde commented Sep 3, 2025

Hi, @dongjoon-hyun could you help to review this PR ? Thanks

@@ -325,9 +327,9 @@ public boolean checkMemory(double newScale) throws IOException {
}

private boolean checkMemory() throws IOException {
if (rowsSinceCheck >= ROWS_PER_CHECK) {
long size = treeWriter.estimateMemory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If estimateMemory is called for each batch write, will the performance of the write be degraded?

Copy link
Author

@wankunde wankunde Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If rowsSinceCheck >= ROWS_PER_CHECK, the estimateMemory will be called the same as before
  • If rowsSinceCheck < ROWS_PER_CHECK, the estimateMemory will be called ROWS_PER_CHECK / VectorizedRowBatch.size times, and 5000 / 1024 = 4 times by default, I think this is much smaller overhead in the overall computation.

@@ -118,6 +118,11 @@ public enum OrcConf {
"If the number of distinct keys in a dictionary is greater than this\n" +
"fraction of the total number of non-null rows, turn off \n" +
"dictionary encoding. Use 1 to always use dictionary encoding."),
DICTIONARY_MAX_SIZE_IN_BYTES("orc.dictionary.maxSizeInBytes",
"hive.exec.orc.dictionary.maxSizeInBytes",
16 * 1024 * 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why this value is the default, @wankunde ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary.size = 5120, row size = 5120, dictionary SizeInBytes = 360448, rows SizeInBytes = 32768
dictionary.size = 5120, row size = 5120, dictionary SizeInBytes = 1452277760, rows SizeInBytes = 32768
dictionary.size = 1, row size = 5120, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768
dictionary.size = 661, row size = 5012, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768
dictionary.size = 5120, row size = 5120, dictionary SizeInBytes = 10223616, rows SizeInBytes = 32768
dictionary.size = 1, row size = 5120, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768
dictionary.size = 682, row size = 5120, dictionary SizeInBytes = 147456, rows SizeInBytes = 32768
dictionary.size = 1, row size = 5120, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768
dictionary.size = 2, row size = 5120, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768
dictionary.size = 1, row size = 5120, dictionary SizeInBytes = 81920, rows SizeInBytes = 32768

The dictionary size in bytes usually less than 10MB.
If dictionary size (>16MB) / stripe size(< 128MB) > 12.5%, I think it large enough to disable the dictionary encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the default dictionary of Presto and Trino, which is also 16mb.

https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/OrcWriterOptions.java#L37

public static final DataSize DEFAULT_DICTIONARY_MAX_MEMORY = new DataSize(16, MEGABYTE);

https://github.com/trinodb/trino/blob/master/lib/trino-orc/src/main/java/io/trino/orc/OrcWriterOptions.java#L52

private static final DataSize DEFAULT_DICTIONARY_MAX_MEMORY = DataSize.of(16, MEGABYTE);

@@ -182,6 +187,9 @@ public enum OrcConf {
"added to all of the writers. Valid range is [1,10000] and is primarily meant for" +
"testing. Setting this too low may negatively affect performance."
+ " Use orc.stripe.row.count instead if the value larger than orc.stripe.row.count."),
STRIPE_SIZE_CHECK("orc.stripe.size.check", "hive.exec.orc.default.stripe.size.check",
128L * 1024 * 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default stripe size is 64MB, so I think 128MB is large enough to flush this strip.

  STRIPE_SIZE("orc.stripe.size", "hive.exec.orc.default.stripe.size",
      64L * 1024 * 1024,
      "Define the default ORC stripe size, in bytes."),

Test with our production jobs, our spark jobs could run with 6GB executors if STRIPE_SIZE_CHECK = 128MB, and need 8GB executors if STRIPE_SIZE_CHECK = 256MB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider the configuration according to the scale of stripe size?

orc.stripe.size*ratio

For large input rows, the stripe may excessively large , requiring more memory for both reading and writing one strip.
We can check the tree write size in bytes and flush the strip even when the input rows count is less than 5000.
@wankunde
Copy link
Author

wankunde commented Sep 8, 2025

Hi, @cxzl25 @dongjoon-hyun

  • Allow use 0 to disable this new check.
  • Add local benchmark, 733834d
image

@@ -182,6 +187,9 @@ public enum OrcConf {
"added to all of the writers. Valid range is [1,10000] and is primarily meant for" +
"testing. Setting this too low may negatively affect performance."
+ " Use orc.stripe.row.count instead if the value larger than orc.stripe.row.count."),
STRIPE_SIZE_CHECK("orc.stripe.size.check", "hive.exec.orc.default.stripe.size.check",
128L * 1024 * 1024,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider the configuration according to the scale of stripe size?

orc.stripe.size*ratio

@wankunde
Copy link
Author

@cxzl25 Thanks for your review. Change STRIPE_SIZE_CHECK to OrcConf.STRIPE_SIZE_CHECKRATIO * stripeSize

@wankunde wankunde force-pushed the force_spill_stripe branch 2 times, most recently from 82fc656 to 64b2998 Compare September 15, 2025 03:04
@wankunde
Copy link
Author

Hi, @dongjoon-hyun do you have any thoughts on this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants