fix(helm): add livenessProbe to read pods in Simple Scalable mode #17063

someStrangerFromTheAbyss · 2025-04-08T15:53:49Z

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #15191
Special notes for your reviewer:
This mainly add livenessProbe for the read pod. This will maybe helm some issues with the read pods when deploying a loki in simple Scalable mode

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

…t if unset or null, does not provoke any errors

benjaminlebigot · 2025-04-16T15:01:21Z

Just rebased on top of grafana/loki:main branch.

benjaminlebigot · 2025-04-16T17:39:19Z

lint: amended the commit to run helm-docs

benjaminlebigot · 2025-04-17T16:39:56Z

Rebased on top of grafana/loki:main branch.

…t if unset or null, does not provoke any errors

production/helm/loki/templates/read/deployment-read.yaml

…ngerFromTheAbyss/loki into fix-read-liveness-probe

…iness instead of liveness

someStrangerFromTheAbyss · 2025-04-29T13:01:12Z

Missed the automatic weekly PR because of conflicts. Will fix them by next Monday before next weekly MR pickup

Signed-off-by: someStrangerFromTheAbyss <[email protected]>

CLAassistant · 2025-05-06T08:00:55Z

All committers have signed the CLA.

Jayclifford345 · 2025-05-27T15:17:08Z

Hey @someStrangerFromTheAbyss, would you mind running:

make helm-docs

from the top of the loki directory this will deal with the failing documentation check

someStrangerFromTheAbyss · 2025-05-27T15:19:04Z

@Jayclifford345 Will do right away

Jayclifford345 · 2025-05-27T15:23:45Z

Hi @someStrangerFromTheAbyss, I am just in the process of reviewing this PR, completely understandable you want to use a liveliness probe. Would you mind supplying the Loki values file you used to test this PR? I will have a quick spin myself and see if we can get this unstuck for you

someStrangerFromTheAbyss · 2025-05-27T15:44:22Z

@Jayclifford345

nameOverride: loki
fullnameOverride: loki

# Source code for the chart is here: https://github.com/grafana/loki/tree/main/production/helm/loki

loki:
  schemaConfig:
    configs:
      - from: 2024-07-29
        store: tsdb 
        object_store: azure
        schema: v13
        index:
          prefix: index_
          period: 24h
  auth_enabled: false
  configStorageType: Secret
  image:
    tag: 3.4.3
    pullPolicy: IfNotPresent
  storage:
    type: azure
    bucketNames:
      chunks: loki
      ruler: loki
      admin: loki
    azure:
      accountKey: FAKE
      accountName: FAKE
      useManagedIdentity: false
      requestTimeout: 30s
  server:
    http_listen_port: 3100
    grpc_listen_port: 9095
    grpc_server_max_recv_msg_size: 60000000
    grpc_server_max_send_msg_size: 60000000
    http_server_idle_timeout: 600s
    log_level: "warn"
  limits_config:
    # max_query_length: how many days of logs you can query at a time
    # NEVER CHANGE THIS!
    max_query_length: "14d1h"
    # max_query_lookback: how far in the past you can query.
    # Note that this is the same as the retention_period plus 1 hour, that's on purpose!
    max_query_lookback: "14d1h"
    # max_query_range: how many days to include in a range.
    # Ex. sum by (level) (count_over_time({job="your_job"}[5m])) <-- the range is 5m here, 6d is the max we allow
    # NEVER CHANGE THIS!
    max_query_range: "6d"
    max_line_size_truncate: False
    ingestion_rate_strategy: local
    per_stream_rate_limit: 20Mb
    per_stream_rate_limit_burst: 30Mb
    ingestion_burst_size_mb: 30
    ingestion_rate_mb: 30
    max_chunks_per_query: 100
    max_query_series: 50
    max_query_parallelism: 128
    max_streams_matchers_per_query: 100
    max_entries_limit_per_query: 1000
    max_global_streams_per_user: 50000
    # Max client can assign is 10, but the max should be 17.
    # So it comes to a total of 17
    max_label_names_per_series: 17
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    max_cache_freshness_per_query: 10m
    split_queries_by_interval: 1h
    max_concurrent_tail_requests: 100
    tsdb_max_query_parallelism: 2048
    tsdb_max_bytes_per_shard: 2GB
    allow_structured_metadata: true
    discover_service_name: []
    shard_streams:
      enabled: True
      desired_rate: 3000000
    # retention_period: how many days of logs are kept.
    # the logs older than that are deleted
    # Also see max_query_lookback above
    retention_period: "2d"
    otlp_config:
      resource_attributes:
        ignore_defaults: true
        attributes_config:
          - action: index_label
            regex: rome.project.id
          - action: index_label
            regex: resources."service.name"
          - action: index_label
            regex: service.name
  storage_config:
    boltdb_shipper: null
  compactor:  
    working_directory: /tmp
    delete_request_store: azure
    compaction_interval: 10m
    retention_enabled: true 
    retention_delete_delay: 2h  
    retention_delete_worker_count: 150
    delete_request_cancel_period: 24h
  ingester:
    chunk_target_size: 1572864
    chunk_idle_period: 30m
    wal:
      replay_memory_ceiling: 2048MB
    flush_check_period: 15s  
  commonConfig:
    replication_factor: 1
  rulerConfig:
    wal:
      dir: /var/loki/ruler-wal
    alertmanager_url: 'http://alertmanager:9093'
    alertmanager_client:
      basic_auth_username: ''
      basic_auth_password: ''
    enable_api: true
    enable_alertmanager_v2: true

read:
  replicas: 3
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 16Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations:
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-read"
  podLabels:
    mon.FAKE.org/scrape: "true"
  # -- liveness probe settings for read pods. If empty, applies no livenessProbe
  livenessProbe:
    httpGet:
      path: "/loki/api/v1/labels?since=1h"
      port: 3100
    initialDelaySeconds: 30
    periodSeconds: 30
  resources:
    requests:
      cpu: 625m
      memory: 2474Mi
    limits:
      cpu: 3
      memory: 2474Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "TEST"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule

write:
  replicas: 1
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 24Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations: 
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-write"
  podLabels: 
    mon.FAKE.org/scrape: "true"
    rome.FAKE.org/projectId: "83ebddba-0f99-4965-9148-29e6d077f95e"
    FAKE: logs
    FAKE: mops
  resources: 
    requests:
      cpu: 625m
      memory: 2474Mi
    limits:
      memory: 2474Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "83ebddba-0f99-4965-9148-29e6d077f95e"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule
  autoscaling:
    # Using recommendation for autoscaling of using CPU instead of some metrics. 
    # Since write can also be queried, we cannot rely on incoming traffic to scale.
    # https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L1356
    enabled: true
    minReplicas: 1
    maxReplicas: 25
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 80
    behavior:
      # -- see https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown for scaledown details
      # https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L1447
      scaleUp:
        policies:
          - type: Pods
            value: 1
            # By default, this is set to 900 seconds. That means that kubernetes will check every 900 seconds to see if it should scale up
            # We decrease it to 60 seconds to better accomodate when they are spike in logs. Since each stack is small, this is necessary.
            periodSeconds: 60
      scaleDown:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 1800
        stabilizationWindowSeconds: 3600

backend:
  # Backend replicas SHOULD NEVER BE 1 ! This make upgrade, deployment and other unstable. Keep it at minimum 2
  replicas: 2
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 16Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations:
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-backend"
  podLabels: 
    mon.FAKE.org/scrape: "true"
    rome.FAKE.org/projectId: "83ebddba-0f99-4965-9148-29e6d077f95e"
    FAKE: logs
    FAKE: mops
  resources: 
    requests:
      cpu: 500m
      memory: 640Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "83ebddba-0f99-4965-9148-29e6d077f95e"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 25
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 80
    behavior:
      # -- see https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown for scaledown details
      scaleUp:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 900
      scaleDown:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 1800
        stabilizationWindowSeconds: 3600

querier:
  max_concurrent: 6
  extra_query_delay: 500ms    
  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 60000000
  
query_scheduler:
  max_outstanding_requests_per_tenant: 1000

distributor:
  receivers:
    otlp:
      grpc:
        max_recv_msg_size_mib: 60000000

frontend:
  max_outstanding_per_tenant: 1000
  scheduler_worker_concurrency: 15

query_range:
  parallelise_shardable_queries: true
  split_queries_by_interval: 1h
  max_concurrent: 6
  results_cache:
    cache_results: true
    cache_validity: 5m

index:
  prefix: index_
  period: 168h
  in-memory-sorted-index:
    retention_period: 24h

gateway:
  enabled: true
  nodeSelector:
    dedicated: logs-instance
  ingress:
    enabled: false
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule

Its a big value file, but its a real one we use. OF course, credentials replace with FAKE.

someStrangerFromTheAbyss added 2 commits April 8, 2025 11:06

chore: adding livenessProbe to the read pods

8cf9fac

fix: changed the inclusion of livenessProbe to be conditionnal so tha…

20aa051

…t if unset or null, does not provoke any errors

someStrangerFromTheAbyss requested a review from a team as a code owner April 8, 2025 15:53

pull-request-size bot added the size/S label Apr 8, 2025

github-actions bot added the area/helm label Apr 8, 2025

fix(doc): adding missing documentation with the make helm-docs

e78c385

pull-request-size bot added size/M and removed size/S labels Apr 8, 2025

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Apr 8, 2025

benjaminlebigot force-pushed the fix-read-liveness-probe branch from e9a1758 to d431034 Compare April 16, 2025 14:37

JStickler mentioned this pull request Apr 16, 2025

Grafana loki deploys sometimes with a bad read pod #15191

Open

benjaminlebigot force-pushed the fix-read-liveness-probe branch from d431034 to 139af1a Compare April 16, 2025 17:35

benjaminlebigot force-pushed the fix-read-liveness-probe branch from 139af1a to 9746718 Compare April 17, 2025 16:39

benjaminlebigot force-pushed the fix-read-liveness-probe branch 2 times, most recently from ea3e463 to b9b00ae Compare April 18, 2025 15:11

someStrangerFromTheAbyss added 3 commits April 22, 2025 15:05

chore: adding livenessProbe to the read pods

f8a8f90

fix: changed the inclusion of livenessProbe to be conditionnal so tha…

bf0aceb

…t if unset or null, does not provoke any errors

fix(doc): adding missing documentation with the make helm-docs

f813039

benjaminlebigot force-pushed the fix-read-liveness-probe branch from b9b00ae to f813039 Compare April 22, 2025 13:05

rainchik reviewed Apr 23, 2025

View reviewed changes

production/helm/loki/templates/read/deployment-read.yaml Outdated Show resolved Hide resolved

someStrangerFromTheAbyss added 2 commits April 23, 2025 08:37

Merge branch 'fix-read-liveness-probe' of https://github.com/someStra…

9ab0562

…ngerFromTheAbyss/loki into fix-read-liveness-probe

fix(helm): reverted accidental change made on the deployment for read…

845f46b

…iness instead of liveness

Merge branch 'main' into fix-read-liveness-probe

23c0d17

Signed-off-by: someStrangerFromTheAbyss <[email protected]>

pull-request-size bot added size/S and removed size/M labels May 5, 2025

Merge branch 'main' into fix-read-liveness-probe

81f9b0d

someStrangerFromTheAbyss added 5 commits May 6, 2025 08:11

Merge branch 'main' into fix-read-liveness-probe

2e157e3

Merge branch 'main' into fix-read-liveness-probe

3952755

Merge branch 'main' into fix-read-liveness-probe

82715a1

Merge branch 'main' into fix-read-liveness-probe

3d7cc2d

Merge branch 'main' into fix-read-liveness-probe

6325f18

Jayclifford345 self-assigned this May 27, 2025

fix(doc): regenerate doc

95c8d1b

pull-request-size bot added size/M and removed size/S labels May 27, 2025

Starefossen approved these changes May 28, 2025

View reviewed changes

Merge branch 'main' into fix-read-liveness-probe

62b0144

Jayclifford345 approved these changes May 28, 2025

View reviewed changes

Jayclifford345 merged commit 68d9395 into grafana:main May 28, 2025
75 checks passed

loki-gh-app bot mentioned this pull request Jul 28, 2025

chore(k265): release 3.5.0 #18605

Open

This was referenced Aug 5, 2025

chore(k266): release 3.5.0 #18722

Open

chore(k267): release 3.5.0 #18796

Open

loki-gh-app bot mentioned this pull request Aug 18, 2025

chore(k268): release 3.5.0 #18891

Open

loki-gh-app bot mentioned this pull request Sep 1, 2025

chore(k270): release 3.5.0 #19075

Open

loki-gh-app bot mentioned this pull request Sep 8, 2025

chore(k271): release 3.5.0 #19126

Open

loki-gh-app bot mentioned this pull request Sep 15, 2025

chore(k272): release 3.5.0 #19194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(helm): add livenessProbe to read pods in Simple Scalable mode #17063

fix(helm): add livenessProbe to read pods in Simple Scalable mode #17063

Uh oh!

someStrangerFromTheAbyss commented Apr 8, 2025 •

edited

Loading

Uh oh!

benjaminlebigot commented Apr 16, 2025

Uh oh!

benjaminlebigot commented Apr 16, 2025

Uh oh!

benjaminlebigot commented Apr 17, 2025

Uh oh!

Uh oh!

someStrangerFromTheAbyss commented Apr 29, 2025

Uh oh!

CLAassistant commented May 6, 2025 •

edited

Loading

Uh oh!

Jayclifford345 commented May 27, 2025

Uh oh!

someStrangerFromTheAbyss commented May 27, 2025

Uh oh!

Jayclifford345 commented May 27, 2025

Uh oh!

someStrangerFromTheAbyss commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

fix(helm): add livenessProbe to read pods in Simple Scalable mode #17063

fix(helm): add livenessProbe to read pods in Simple Scalable mode #17063

Uh oh!

Conversation

someStrangerFromTheAbyss commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjaminlebigot commented Apr 16, 2025

Uh oh!

benjaminlebigot commented Apr 16, 2025

Uh oh!

benjaminlebigot commented Apr 17, 2025

Uh oh!

Uh oh!

someStrangerFromTheAbyss commented Apr 29, 2025

Uh oh!

CLAassistant commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jayclifford345 commented May 27, 2025

Uh oh!

someStrangerFromTheAbyss commented May 27, 2025

Uh oh!

Jayclifford345 commented May 27, 2025

Uh oh!

someStrangerFromTheAbyss commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

someStrangerFromTheAbyss commented Apr 8, 2025 •

edited

Loading

CLAassistant commented May 6, 2025 •

edited

Loading

someStrangerFromTheAbyss commented May 27, 2025 •

edited

Loading