-
Notifications
You must be signed in to change notification settings - Fork 3.8k
feat: limits-frontend failover to other zones #17408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// can use sort.Search to subtract the two slices. | ||
slices.Sort(streamHashesToDelete) | ||
streamHashesToQuery = slices.DeleteFunc(streamHashesToQuery, func(streamHashToQuery uint64) bool { | ||
// see https://pkg.go.dev/sort#Search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be O(N + logN)
66c02a0
to
57ccf97
Compare
pkg/limits/frontend/ring.go
Outdated
return responses, nil | ||
} | ||
} | ||
// Treat remaining stream hashes as unknown streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an interesting question here of is this the correct behavior, or should this be an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where some subset of stream hashes cannot be queried because either:
- There are no pods assigned as consumers for the partition that owns the stream hashes
- All pods across all zones that consume this partition were unavailable
This subset of stream hashes will be treated as unknown streams. Is this the correct thing to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the correct behavior here is to return an error because:
- We classify only two non ambiguous types or unknown streams:
exceeds_max_streams
andexceeds_rate_limit
. Adding a third one for "could-not-query` is too ambiguous. - Even if we return the streams as unknown back it is not easy to tell the distributor clients what to do. Should they just retry? Or bubble up an error to the agents?
e2f6226
to
37b2513
Compare
4072cd3
to
71eff46
Compare
break | ||
} | ||
} | ||
// TODO(grobinson): In a subsequent change, I will figure out what to do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At present, unanswered streams are permitted. First I want to emit a metric to track this, and I will do this in a follow up PR to make the change easier to see.
71eff46
to
b5553fb
Compare
620e79a
to
b9480bb
Compare
This commit updates the limits-frontend to failover to other zones for stream hashes that cannot be queried in the current zone.
b9480bb
to
c0ab884
Compare
numAssignedPartitionsRequests int | ||
numExceedsLimitsRequests int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking, no need for a mutex for these? no concurrent test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No mutex required right now. Each test case uses a separate mock, and requests are executed in sequence.
if err != nil { | ||
continue | ||
} | ||
responses = append(responses, resps...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
responses = append(responses, resps...) | |
responses = append(responses, resps...) | |
if len(streams) == len(answered) { | |
break | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to delete the streams, can't just break
, as we will check len(streams)
later to see if some streams went unanswered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to look at this in the next PR 👍
What this PR does / why we need it:
This pull request updates the limits-frontend to failover to other zones for stream hashes that cannot be queried in the current zone.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)feat
PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.md
deprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR