Skip to content

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Apr 23, 2025

What this PR does / why we need it:

This pull request updates the limits-frontend to failover to other zones for stream hashes that cannot be queried in the current zone.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@grobinson-grafana grobinson-grafana requested a review from a team as a code owner April 23, 2025 11:24
// can use sort.Search to subtract the two slices.
slices.Sort(streamHashesToDelete)
streamHashesToQuery = slices.DeleteFunc(streamHashesToQuery, func(streamHashToQuery uint64) bool {
// see https://pkg.go.dev/sort#Search
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be O(N + logN)

@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch 3 times, most recently from 66c02a0 to 57ccf97 Compare April 23, 2025 16:25
return responses, nil
}
}
// Treat remaining stream hashes as unknown streams.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an interesting question here of is this the correct behavior, or should this be an error?

Copy link
Contributor Author

@grobinson-grafana grobinson-grafana Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where some subset of stream hashes cannot be queried because either:

  1. There are no pods assigned as consumers for the partition that owns the stream hashes
  2. All pods across all zones that consume this partition were unavailable

This subset of stream hashes will be treated as unknown streams. Is this the correct thing to do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the correct behavior here is to return an error because:

  1. We classify only two non ambiguous types or unknown streams: exceeds_max_streams and exceeds_rate_limit. Adding a third one for "could-not-query` is too ambiguous.
  2. Even if we return the streams as unknown back it is not easy to tell the distributor clients what to do. Should they just retry? Or bubble up an error to the agents?

@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch 3 times, most recently from e2f6226 to 37b2513 Compare April 23, 2025 20:22
@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch 7 times, most recently from 4072cd3 to 71eff46 Compare May 27, 2025 14:28
break
}
}
// TODO(grobinson): In a subsequent change, I will figure out what to do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, unanswered streams are permitted. First I want to emit a metric to track this, and I will do this in a follow up PR to make the change easier to see.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch from 71eff46 to b5553fb Compare May 27, 2025 18:26
@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch 2 times, most recently from 620e79a to b9480bb Compare May 27, 2025 18:36
This commit updates the limits-frontend to failover to other zones
for stream hashes that cannot be queried in the current zone.
@grobinson-grafana grobinson-grafana force-pushed the grobinson/failover-to-other-zones branch from b9480bb to c0ab884 Compare May 28, 2025 08:27
Comment on lines 56 to 57
numAssignedPartitionsRequests int
numExceedsLimitsRequests int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking, no need for a mutex for these? no concurrent test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mutex required right now. Each test case uses a separate mock, and requests are executed in sequence.

if err != nil {
continue
}
responses = append(responses, resps...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.

Suggested change
responses = append(responses, resps...)
responses = append(responses, resps...)
if len(streams) == len(answered) {
break
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to delete the streams, can't just break, as we will check len(streams) later to see if some streams went unanswered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to look at this in the next PR 👍

@grobinson-grafana grobinson-grafana merged commit 93e829a into main May 28, 2025
65 checks passed
@grobinson-grafana grobinson-grafana deleted the grobinson/failover-to-other-zones branch May 28, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants