Skip to content

Conversation

joechenrh
Copy link
Contributor

@joechenrh joechenrh commented Feb 11, 2025

What problem does this PR solve?

Issue Number: close #56104, close #60224

Problem Summary:

What changed and how does it work?

Constructing file infos in parallel for both lightning and IMPORT INTO.

  • For lightning, the concurrency is RegionConcurrency * 2.
  • For IMPORT INTO, the concurrency is the task's ThreadCnt * 2.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

The following results are tested by lightning with 16 core (so concurrency is 32). Data is stored on S3. IMPORT INTO should have the similar result.

  • Importing one table with 1000 * 10M parquet files, the time is reduced from 6.8min to 16.7s
  • Importing one table with 1000 * 7M compressed CSV files, the time is reduced from 5min to 10s.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 11, 2025
Copy link

tiprow bot commented Feb 11, 2025

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 82.48175% with 24 lines in your changes missing coverage. Please review.

Project coverage is 75.8478%. Comparing base (1cd3943) to head (67db2eb).
Report is 529 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #59382        +/-   ##
================================================
+ Coverage   73.1105%   75.8478%   +2.7373%     
================================================
  Files          1711       1760        +49     
  Lines        473248     492946     +19698     
================================================
+ Hits         345994     373889     +27895     
+ Misses       105966      96595      -9371     
- Partials      21288      22462      +1174     
Flag Coverage Δ
integration 49.7259% <60.5839%> (?)
unit 73.3164% <66.6666%> (+0.9863%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.6553% <ø> (ø)
parser ∅ <ø> (∅)
br 62.3206% <ø> (+15.0838%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@joechenrh
Copy link
Contributor Author

/cc @kennytm

@ti-chi-bot ti-chi-bot bot requested a review from kennytm February 12, 2025 01:44
Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7/9

}

var err error
if dataFiles, err = mydump.ParallelProcess(ctx, allFiles, e.ThreadCnt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dataFiles, err = mydump.ParallelProcess(ctx, allFiles, e.ThreadCnt,
if dataFiles, err = mydump.ParallelProcess(ctx, allFiles, e.ThreadCnt*2,

just like lightning does, we can double it

Copy link
Contributor

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6/11

@@ -142,6 +144,10 @@ type MDLoaderSetupConfig struct {
// MaxScanFiles specifies the maximum number of files to scan.
// If the value is <= 0, it means the number of data source files will be scanned as many as possible.
MaxScanFiles int

// ScanFileConcurrency specifes the concurrency of scaning source files.
ScanFileConcurrency int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just move it to LoaderConfig, it's always needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's used for mdLoaderSetup, so I only add it in MDLoaderSetupConfig and modify it using WithScanFileConcurrency, just like WithMaxScanFiles.

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2025
@joechenrh
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented Apr 10, 2025

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented Apr 11, 2025

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kennytm
Copy link
Contributor

kennytm commented Apr 11, 2025

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Apr 11, 2025
@joechenrh
Copy link
Contributor Author

/retest

2 similar comments
@joechenrh
Copy link
Contributor Author

/retest

@joechenrh
Copy link
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit cc8d9cb into pingcap:master Apr 11, 2025
25 checks passed
@joechenrh
Copy link
Contributor Author

/cherry-pick release-8.5

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 11, 2025
@ti-chi-bot
Copy link
Member

@joechenrh: new pull request created to branch release-8.5: #60508.
But this PR has conflicts, please resolve them!

In response to this:

/cherry-pick release-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@joechenrh joechenrh deleted the parallel-read branch May 27, 2025 10:20
@Benjamin2037 Benjamin2037 added needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. labels Jun 23, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: failed to apply #59382 on top of branch "release-8.5":

failed to git commit: exit status 1

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jun 23, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #61920.
But this PR has conflicts, please resolve them!

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Jul 3, 2025
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jul 3, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #62171.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
6 participants