Skip to content

Conversation

periklis
Copy link
Collaborator

What this PR does / why we need it:
This pull request makes the ingest-limits readiness handler to consider partition state before signaling ready. I.e. the service is considered to be ready and consume traffic when the following conditions apply:

  • The partition manager assigned partitions (max retries 10)
  • The partition lifecycler moved all partitions from pending/replaying to ready state.

Which issue(s) this PR fixes:
Fixes #grafana/loki-private/issues/1722

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@periklis periklis requested a review from a team as a code owner June 12, 2025 08:22
@periklis periklis self-assigned this Jun 12, 2025
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch 3 times, most recently from e2c8706 to f5a3e24 Compare June 18, 2025 13:51
This commit moves the partition readiness check after the service
and lifecycler readiness checks. It does this because the
partition readiness check cannot pass until the consumer is running,
and the consumer.Run() goroutine is not started until running()
is called. That means it makes sense to check the service state and
the lifecycler before checking the partitions.
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from f5a3e24 to 6e0276f Compare June 18, 2025 13:52
This commit fixes a race condition where the maximum attempts is
never satisifed. This race condition comes about as two (or more)
goroutines can load and increment the atomic concurrently, which
would cause the next Load() to load a value greater than the
maximum number of attempts, breaking the equality check.
@pull-request-size pull-request-size bot added size/M and removed size/L labels Jun 18, 2025
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from 60ee2da to e962ba3 Compare June 18, 2025 14:15
This commit fixes the race condition that exists between two
(or more) goroutines concurrently executing CheckReady. What can
happen is both goroutines call Load at the same time, do the check,
and then increment the number of attempts without using check-and-set.
This allowed the number of attempts to exceed the maximum number
of attempts.
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from e962ba3 to a6c2ffa Compare June 18, 2025 14:15
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 18, 2025
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from 414fff0 to 4a8dabf Compare June 18, 2025 15:44
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from 4a8dabf to 865bc62 Compare June 18, 2025 15:44
@grobinson-grafana grobinson-grafana force-pushed the partition-aware-readiness branch from c12dffd to c83e653 Compare June 18, 2025 15:46
Copy link
Collaborator Author

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// The maximum amount of time to wait to join the consumer group and be
// assigned some partitions before giving up and going ready in
// [Service.CheckReady].
partitionReadinessWaitAssignPeriod = 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this so now the time that we wait is independent of:

  1. The readiness check interval
  2. The number of clients calling /ready

Copy link
Collaborator Author

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@grobinson-grafana grobinson-grafana merged commit a07bee5 into main Jun 20, 2025
65 checks passed
@grobinson-grafana grobinson-grafana deleted the partition-aware-readiness branch June 20, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants