-
Notifications
You must be signed in to change notification settings - Fork 3.8k
feat: check for failed reason in distributors #18128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: check for failed reason in distributors #18128
Conversation
level.Error(d.logger).Log("msg", "failed to check if request exceeds limits, request has been accepted", "err", err) | ||
} else if len(streamsAfterLimits) == 0 { | ||
// All streams have been dropped. | ||
level.Debug(d.logger).Log("msg", "request exceeded limits, all streams will be dropped", "tenant", tenantID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of these log lines, will be far too much volume.
break | ||
} | ||
} | ||
if !found || reason == uint32(limits.ReasonFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we stopped using iota
, no valid reason can ever be 0
, meaning we can do a check against the default value of uint32
.
Reason: uint32(limits.ReasonMaxStreams), | ||
}}, | ||
}, | ||
expectedErr: "rpc error: code = Code(429) desc = request exceeded limits: max streams", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of the reason for now, I'm not sure if its useful as it doesn't mention which streams. I want to think about how to better communicate this data back to the user given that some requests can be really large.
56bfab8
to
9c11c0b
Compare
// All streams were rejected, the request should be failed. | ||
return nil, httpgrpc.Error(http.StatusTooManyRequests, "request exceeded limits") | ||
} | ||
streams = accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we "shink" the streams slice to the accepted ones, I believe the next check of the ingestionRateLimiter is operating on the wrong calculated value validationContext.validationMetrics.aggregatedPushStats.lineSize
and we need to redo it here, right?
Imagine accepting a subset of the incoming streams but rate limiting on the total incoming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about how to solve this 😢 We had the same behavior before this PR where we are rate limiting on discarded streams over the stream limit, so I think let's keep this PR scoped to the feature and then I will open a second PR to address this problem.
// the original backing array. See "Filtering without allocation" from | ||
// https://go.dev/wiki/SliceTricks. | ||
withinLimits := make([]KeyedStream, 0, len(streams)) | ||
accepted := make([]KeyedStream, 0, len(streams)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to consider doing the "filtering without allocation" trick here if other validation stats are in following updated by this. OTH the backing array includes shards and not streams right, so the trick might not be possible at all. WDYT?
// TODO(grobinson): We have an O(N*M) loop here. Need to benchmark if | ||
// its faster to do this or if we should create a map instead. | ||
var ( | ||
found bool | ||
reason uint32 | ||
) | ||
for _, res := range results { | ||
if res.StreamHash == s.HashKeyNoShard { | ||
found = true | ||
reason = res.Reason | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should measure this adds any latency in case of ingest-limits degradation.
What this PR does / why we need it:
This pull request checks for the new
ReasonFailed
in the distributors.Stack on top of #18055 and #18123.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)feat
PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.md
deprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR