Skip to content

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Jun 17, 2025

What this PR does / why we need it:

This pull request checks for the new ReasonFailed in the distributors.

Stack on top of #18055 and #18123.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@grobinson-grafana grobinson-grafana requested a review from a team as a code owner June 17, 2025 11:13
level.Error(d.logger).Log("msg", "failed to check if request exceeds limits, request has been accepted", "err", err)
} else if len(streamsAfterLimits) == 0 {
// All streams have been dropped.
level.Debug(d.logger).Log("msg", "request exceeded limits, all streams will be dropped", "tenant", tenantID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of these log lines, will be far too much volume.

break
}
}
if !found || reason == uint32(limits.ReasonFailed) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we stopped using iota, no valid reason can ever be 0, meaning we can do a check against the default value of uint32.

Reason: uint32(limits.ReasonMaxStreams),
}},
},
expectedErr: "rpc error: code = Code(429) desc = request exceeded limits: max streams",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the reason for now, I'm not sure if its useful as it doesn't mention which streams. I want to think about how to better communicate this data back to the user given that some requests can be really large.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/check-failed-reason-distributors branch from 56bfab8 to 9c11c0b Compare June 17, 2025 14:39
// All streams were rejected, the request should be failed.
return nil, httpgrpc.Error(http.StatusTooManyRequests, "request exceeded limits")
}
streams = accepted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we "shink" the streams slice to the accepted ones, I believe the next check of the ingestionRateLimiter is operating on the wrong calculated value validationContext.validationMetrics.aggregatedPushStats.lineSize and we need to redo it here, right?

Imagine accepting a subset of the incoming streams but rate limiting on the total incoming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about how to solve this 😢 We had the same behavior before this PR where we are rate limiting on discarded streams over the stream limit, so I think let's keep this PR scoped to the feature and then I will open a second PR to address this problem.

// the original backing array. See "Filtering without allocation" from
// https://go.dev/wiki/SliceTricks.
withinLimits := make([]KeyedStream, 0, len(streams))
accepted := make([]KeyedStream, 0, len(streams))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to consider doing the "filtering without allocation" trick here if other validation stats are in following updated by this. OTH the backing array includes shards and not streams right, so the trick might not be possible at all. WDYT?

Comment on lines +109 to +121
// TODO(grobinson): We have an O(N*M) loop here. Need to benchmark if
// its faster to do this or if we should create a map instead.
var (
found bool
reason uint32
)
for _, res := range results {
if res.StreamHash == s.HashKeyNoShard {
found = true
reason = res.Reason
break
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should measure this adds any latency in case of ingest-limits degradation.

@grobinson-grafana grobinson-grafana enabled auto-merge (squash) June 18, 2025 10:18
@grobinson-grafana grobinson-grafana merged commit 104c457 into main Jun 18, 2025
65 checks passed
@grobinson-grafana grobinson-grafana deleted the grobinson/check-failed-reason-distributors branch June 18, 2025 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants