Skip to content

Conversation

Erigara
Copy link
Contributor

@Erigara Erigara commented May 30, 2025

Which issue does this PR close?

What changes are included in this PR?

  • compare duplicates by loading manifest files and taking file_path from it
  • use direct calls instead of scan

Are these changes tested?

  • work for me local experiments
  • fixed existing tests
  • added a new test to showcase behavior

@Erigara Erigara force-pushed the fix/append_files_check_duplicates branch 2 times, most recently from 9813024 to 4dd9e78 Compare May 30, 2025 18:36
@@ -379,6 +364,7 @@ mod tests {

// Attempt to add the existing Parquet files with fast append.
let new_tx = fast_append_action
.with_check_duplicate(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what was initial purpose of the test.
But it looks like to check ability of adding existing parquet files to the table (physically existing not in table).
In which case adding with_check_duplicate(false) here is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a wrong test, we should remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@Erigara Erigara force-pushed the fix/append_files_check_duplicates branch 3 times, most recently from 08f5af5 to a90174a Compare May 30, 2025 19:07
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Erigara for this pr, generally looks good!

Comment on lines 135 to 138
let mut append_action = tx
.fast_append(None, vec![])
.unwrap()
.with_check_duplicate(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this? I think if there exists duplicated files in this test, we should remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By just looking at the test idk why we are trying to commit the same files twice.
So i agree let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -379,6 +364,7 @@ mod tests {

// Attempt to add the existing Parquet files with fast append.
let new_tx = fast_append_action
.with_check_duplicate(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a wrong test, we should remove it.

@Erigara Erigara force-pushed the fix/append_files_check_duplicates branch from a90174a to 00b7b87 Compare June 6, 2025 11:38
@Erigara Erigara requested a review from liurenjie1024 June 6, 2025 11:39
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Erigara for this pr!

@liurenjie1024 liurenjie1024 merged commit bddffa1 into apache:main Jun 7, 2025
17 checks passed
@Erigara Erigara deleted the fix/append_files_check_duplicates branch June 8, 2025 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(iceberg): add_files doesn't actually check duplicated files
2 participants