feat(ingest-limits): Enforce synchronously max stream limit per partition #17527

periklis · 2025-04-30T13:44:04Z

What this PR does / why we need it:
This pull request is adding the implementation to enforce synchronously the max stream limit per partition on each ingest limit pod. This ensures that we can check the limits locally per partition without suffering consumption lag from the queue.

Which issue(s) this PR fixes:
Fixes grafana/loki-private#1632

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

grobinson-grafana · 2025-04-30T13:47:28Z

pkg/limits/ingest_limits.go

+		partitionID := int32(stream.StreamHash % uint64(s.cfg.NumPartitions))
+
+		if !s.partitionManager.Has(partitionID) {
+			continue


We should reject this case, at least that's how I've described the expected behavior it in the issue. For example, suppose we send a stream to the wrong instance because we are working on a outdated GetAssignedPartitions response, the stream will be accepted. But we don't know that it should have been accepted as we sent it to the wrong pod that doesn't own this partition.

pkg/limits/stream_metadata.go

pkg/distributor/ingest_limits_test.go

pkg/limits/limits.go

pkg/limits/ingest_limits.go

grobinson-grafana · 2025-05-05T09:09:06Z

pkg/limits/ingest_limits.go

-			return
-		}
+	for _, stream := range req.Streams {
+		partitionID := int32(stream.StreamHash % uint64(s.cfg.NumPartitions))


I think we are missing a check if the partitionID is also assigned?

Yes indeed. Do you think we can silently skip the streams here then? Or should er report this as an error to the frontend to retry on the right partition consumer?

I think for now we need to drop the streams (i.e. reject them). If we skip, then the streams are accepted, and this will allow limits to be exceeded.

Sorry I have to precise a bit my question here. We already drop these streams, however we just write a warn log message in addition. Do we want to bubble up an error to the frontend to retry on other partition consumers?

Ah OK, to be clear, the streams also need to be rejected (not just dropped).

grobinson-grafana · 2025-05-05T09:21:47Z

pkg/limits/stream_metadata.go

-		s.stripes[i][tenant][partitionID] = make([]Stream, 0)
-	}
+	var (
+		exceedLimits  = make(map[Reason][]uint64)


I think this can be a slice of len(streams), as it doesn't look like we really use the result as a map in ingest_limits.go, instead we just iterate over it and append to another slice?

Actually we are using the key (i.e. Reason) as to feed the results:

var results []*logproto.ExceedsLimitsResult for reason, streamHashes := range exceedLimits { for _, streamHash := range streamHashes { results = append(results, &logproto.ExceedsLimitsResult{ StreamHash: streamHash, Reason: uint32(reason), }) } }

Also we will need the key as a differentiator when building the rate limits on a follow up PR.

You misunderstood my question I think! 😄 But the last sentence I think explains why this needs to be a map instead of slice of structs? You want to count all streams that are rate limited, or something like that?

Yes counting rate limited streams was what I had in mind. That's what we did in the frontend previously.

grobinson-grafana · 2025-05-05T09:24:26Z

pkg/limits/stream_metadata.go

+		)
+
+		// Count as active streams all stream that are not expired.
+		for _, stored := range s.stripes[i][tenant][partitionID] {


I have a feeling this will be a 🔥 path, we might need to optimize it later 👍

I noticed this too because on average we have ~20-30k streams per partitions. Let's address this as a counter per partition to be stored along the metadata in a follow up PR. I think this will immediately solve our problems.

How do you intend to keep the count up to date over time?

Nevermind I was thinking out loudly about a active []map[string]map[int32]uint64 close to our stripes that could be counted up in StoreCond and down in Evict, but I totally forgot that evict is periodic. Sorry turns out to be a bad idea.

pkg/limits/stream_metadata.go

grobinson-grafana · 2025-05-05T09:30:55Z

pkg/limits/ingest_limits.go

-
-	// Calculate rate using only data from within the rate window
-	rate := float64(totalSize) / s.cfg.RateWindow.Seconds()
+	s.metrics.tenantIngestedBytesTotal.WithLabelValues(req.Tenant).Add(float64(ingestedBytes))


I think it's OK to move this metric to StoreIf

My preference is to not overload the stream-metadata implementation with more dependencies than needed. Metrics are relevant to the hosting service. WDYT?

periklis self-assigned this Apr 30, 2025

pull-request-size bot added the size/XL label Apr 30, 2025

grobinson-grafana reviewed Apr 30, 2025

View reviewed changes

pkg/limits/stream_metadata.go Outdated Show resolved Hide resolved

pkg/limits/stream_metadata.go Outdated Show resolved Hide resolved

pull-request-size bot added size/XXL and removed size/XL labels May 2, 2025

periklis force-pushed the check-local-ingest-limits branch from c83ddd9 to 229012e Compare May 2, 2025 10:11

pull-request-size bot added size/XL and removed size/XXL labels May 2, 2025

periklis force-pushed the check-local-ingest-limits branch from 229012e to d81ce90 Compare May 2, 2025 14:01

pull-request-size bot added size/XXL and removed size/XL labels May 2, 2025

periklis force-pushed the check-local-ingest-limits branch 2 times, most recently from c9964d6 to 1adf7ee Compare May 2, 2025 16:54

periklis marked this pull request as ready for review May 5, 2025 07:49

periklis requested a review from a team as a code owner May 5, 2025 07:49

periklis changed the title ~~feat(ingest-limits): Check ingest-limits locally~~ feat(ingest-limits): Enforce synchronously max stream limit per partition May 5, 2025

periklis added 4 commits May 5, 2025 10:08

feat(ingest-limits): Check ingest-limits locally

698c089

Rename TryStrore to StoreIf

5205409

Fix linter

1956530

Fix linter

a07946f

periklis force-pushed the check-local-ingest-limits branch from 7d3fabf to a07946f Compare May 5, 2025 08:08

periklis added 3 commits May 5, 2025 10:37

Collect ingested bytes in StoreIf

2c3e2d5

Fix docs

8681519

Fix docs

f728d6b

grobinson-grafana reviewed May 5, 2025

View reviewed changes

periklis added 2 commits May 5, 2025 13:26

Address code review suggestsions

9019cbe

Improve interface

a5dda05

grobinson-grafana approved these changes May 6, 2025

View reviewed changes

periklis merged commit c55f038 into main May 6, 2025
65 checks passed

periklis deleted the check-local-ingest-limits branch May 6, 2025 08:34

This was referenced Jul 7, 2025

chore(k262): release 3.5.0 #18343

Open

chore(k263): release 3.5.0 #18442

Open

This was referenced Jul 21, 2025

chore(k264): release 3.5.0 #18516

Open

chore(k265): release 3.5.0 #18605

Open

This was referenced Aug 5, 2025

chore(k266): release 3.5.0 #18722

Open

chore(k267): release 3.5.0 #18796

Open

loki-gh-app bot mentioned this pull request Aug 18, 2025

chore(k268): release 3.5.0 #18891

Open

feat(ingest-limits): Enforce synchronously max stream limit per partition #17527

feat(ingest-limits): Enforce synchronously max stream limit per partition #17527

Uh oh!

Conversation

periklis commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grobinson-grafana May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

periklis commented Apr 30, 2025 •

edited

Loading

grobinson-grafana May 5, 2025 •

edited

Loading