Skip to content

Conversation

someStrangerFromTheAbyss
Copy link
Contributor

@someStrangerFromTheAbyss someStrangerFromTheAbyss commented Apr 8, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #15191
Special notes for your reviewer:
This mainly add livenessProbe for the read pod. This will maybe helm some issues with the read pods when deploying a loki in simple Scalable mode

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@pull-request-size pull-request-size bot added size/M and removed size/S labels Apr 8, 2025
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Apr 8, 2025
@benjaminlebigot benjaminlebigot force-pushed the fix-read-liveness-probe branch from e9a1758 to d431034 Compare April 16, 2025 14:37
@benjaminlebigot
Copy link
Contributor

Just rebased on top of grafana/loki:main branch.

@benjaminlebigot
Copy link
Contributor

lint: amended the commit to run helm-docs

@benjaminlebigot benjaminlebigot force-pushed the fix-read-liveness-probe branch from 139af1a to 9746718 Compare April 17, 2025 16:39
@benjaminlebigot
Copy link
Contributor

Rebased on top of grafana/loki:main branch.

@benjaminlebigot benjaminlebigot force-pushed the fix-read-liveness-probe branch 2 times, most recently from ea3e463 to b9b00ae Compare April 18, 2025 15:11
@benjaminlebigot benjaminlebigot force-pushed the fix-read-liveness-probe branch from b9b00ae to f813039 Compare April 22, 2025 13:05
@someStrangerFromTheAbyss
Copy link
Contributor Author

Missed the automatic weekly PR because of conflicts. Will fix them by next Monday before next weekly MR pickup

@pull-request-size pull-request-size bot added size/S and removed size/M labels May 5, 2025
@CLAassistant
Copy link

CLAassistant commented May 6, 2025

CLA assistant check
All committers have signed the CLA.

@Jayclifford345 Jayclifford345 self-assigned this May 27, 2025
@Jayclifford345
Copy link
Contributor

Hey @someStrangerFromTheAbyss, would you mind running:

make helm-docs

from the top of the loki directory this will deal with the failing documentation check

@someStrangerFromTheAbyss
Copy link
Contributor Author

@Jayclifford345 Will do right away

@Jayclifford345
Copy link
Contributor

Hi @someStrangerFromTheAbyss, I am just in the process of reviewing this PR, completely understandable you want to use a liveliness probe. Would you mind supplying the Loki values file you used to test this PR? I will have a quick spin myself and see if we can get this unstuck for you

@pull-request-size pull-request-size bot added size/M and removed size/S labels May 27, 2025
@someStrangerFromTheAbyss
Copy link
Contributor Author

someStrangerFromTheAbyss commented May 27, 2025

@Jayclifford345

nameOverride: loki
fullnameOverride: loki

# Source code for the chart is here: https://github.com/grafana/loki/tree/main/production/helm/loki

loki:
  schemaConfig:
    configs:
      - from: 2024-07-29
        store: tsdb 
        object_store: azure
        schema: v13
        index:
          prefix: index_
          period: 24h
  auth_enabled: false
  configStorageType: Secret
  image:
    tag: 3.4.3
    pullPolicy: IfNotPresent
  storage:
    type: azure
    bucketNames:
      chunks: loki
      ruler: loki
      admin: loki
    azure:
      accountKey: FAKE
      accountName: FAKE
      useManagedIdentity: false
      requestTimeout: 30s
  server:
    http_listen_port: 3100
    grpc_listen_port: 9095
    grpc_server_max_recv_msg_size: 60000000
    grpc_server_max_send_msg_size: 60000000
    http_server_idle_timeout: 600s
    log_level: "warn"
  limits_config:
    # max_query_length: how many days of logs you can query at a time
    # NEVER CHANGE THIS!
    max_query_length: "14d1h"
    # max_query_lookback: how far in the past you can query.
    # Note that this is the same as the retention_period plus 1 hour, that's on purpose!
    max_query_lookback: "14d1h"
    # max_query_range: how many days to include in a range.
    # Ex. sum by (level) (count_over_time({job="your_job"}[5m])) <-- the range is 5m here, 6d is the max we allow
    # NEVER CHANGE THIS!
    max_query_range: "6d"
    max_line_size_truncate: False
    ingestion_rate_strategy: local
    per_stream_rate_limit: 20Mb
    per_stream_rate_limit_burst: 30Mb
    ingestion_burst_size_mb: 30
    ingestion_rate_mb: 30
    max_chunks_per_query: 100
    max_query_series: 50
    max_query_parallelism: 128
    max_streams_matchers_per_query: 100
    max_entries_limit_per_query: 1000
    max_global_streams_per_user: 50000
    # Max client can assign is 10, but the max should be 17.
    # So it comes to a total of 17
    max_label_names_per_series: 17
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    max_cache_freshness_per_query: 10m
    split_queries_by_interval: 1h
    max_concurrent_tail_requests: 100
    tsdb_max_query_parallelism: 2048
    tsdb_max_bytes_per_shard: 2GB
    allow_structured_metadata: true
    discover_service_name: []
    shard_streams:
      enabled: True
      desired_rate: 3000000
    # retention_period: how many days of logs are kept.
    # the logs older than that are deleted
    # Also see max_query_lookback above
    retention_period: "2d"
    otlp_config:
      resource_attributes:
        ignore_defaults: true
        attributes_config:
          - action: index_label
            regex: rome.project.id
          - action: index_label
            regex: resources."service.name"
          - action: index_label
            regex: service.name
  storage_config:
    boltdb_shipper: null
  compactor:  
    working_directory: /tmp
    delete_request_store: azure
    compaction_interval: 10m
    retention_enabled: true 
    retention_delete_delay: 2h  
    retention_delete_worker_count: 150
    delete_request_cancel_period: 24h
  ingester:
    chunk_target_size: 1572864
    chunk_idle_period: 30m
    wal:
      replay_memory_ceiling: 2048MB
    flush_check_period: 15s  
  commonConfig:
    replication_factor: 1
  rulerConfig:
    wal:
      dir: /var/loki/ruler-wal
    alertmanager_url: 'http://alertmanager:9093'
    alertmanager_client:
      basic_auth_username: ''
      basic_auth_password: ''
    enable_api: true
    enable_alertmanager_v2: true

read:
  replicas: 3
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 16Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations:
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-read"
  podLabels:
    mon.FAKE.org/scrape: "true"
  # -- liveness probe settings for read pods. If empty, applies no livenessProbe
  livenessProbe:
    httpGet:
      path: "/loki/api/v1/labels?since=1h"
      port: 3100
    initialDelaySeconds: 30
    periodSeconds: 30
  resources:
    requests:
      cpu: 625m
      memory: 2474Mi
    limits:
      cpu: 3
      memory: 2474Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "TEST"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule

write:
  replicas: 1
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 24Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations: 
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-write"
  podLabels: 
    mon.FAKE.org/scrape: "true"
    rome.FAKE.org/projectId: "83ebddba-0f99-4965-9148-29e6d077f95e"
    FAKE: logs
    FAKE: mops
  resources: 
    requests:
      cpu: 625m
      memory: 2474Mi
    limits:
      memory: 2474Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "83ebddba-0f99-4965-9148-29e6d077f95e"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule
  autoscaling:
    # Using recommendation for autoscaling of using CPU instead of some metrics. 
    # Since write can also be queried, we cannot rely on incoming traffic to scale.
    # https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L1356
    enabled: true
    minReplicas: 1
    maxReplicas: 25
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 80
    behavior:
      # -- see https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown for scaledown details
      # https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L1447
      scaleUp:
        policies:
          - type: Pods
            value: 1
            # By default, this is set to 900 seconds. That means that kubernetes will check every 900 seconds to see if it should scale up
            # We decrease it to 60 seconds to better accomodate when they are spike in logs. Since each stack is small, this is necessary.
            periodSeconds: 60
      scaleDown:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 1800
        stabilizationWindowSeconds: 3600

backend:
  # Backend replicas SHOULD NEVER BE 1 ! This make upgrade, deployment and other unstable. Keep it at minimum 2
  replicas: 2
  extraEnv:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          resource: limits.cpu
  nodeSelector:
    dedicated: logs-instance
  persistence:
    size: 16Gi
    storageClass: csi-cinder-sc-delete
    enableStatefulSetAutoDeletePVC: true
  podAnnotations:
    mon.FAKE.org/port: "3100"
    mon.FAKE.org/type: "loki-backend"
  podLabels: 
    mon.FAKE.org/scrape: "true"
    rome.FAKE.org/projectId: "83ebddba-0f99-4965-9148-29e6d077f95e"
    FAKE: logs
    FAKE: mops
  resources: 
    requests:
      cpu: 500m
      memory: 640Mi
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: rome.FAKE.org/projectId
                operator: In
                values:
                - "83ebddba-0f99-4965-9148-29e6d077f95e"
            topologyKey: kubernetes.io/hostname
    podAntiAffinity: null
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 25
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 80
    behavior:
      # -- see https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown for scaledown details
      scaleUp:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 900
      scaleDown:
        policies:
          - type: Pods
            value: 1
            periodSeconds: 1800
        stabilizationWindowSeconds: 3600

querier:
  max_concurrent: 6
  extra_query_delay: 500ms    
  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 60000000
  
query_scheduler:
  max_outstanding_requests_per_tenant: 1000

distributor:
  receivers:
    otlp:
      grpc:
        max_recv_msg_size_mib: 60000000

frontend:
  max_outstanding_per_tenant: 1000
  scheduler_worker_concurrency: 15

query_range:
  parallelise_shardable_queries: true
  split_queries_by_interval: 1h
  max_concurrent: 6
  results_cache:
    cache_results: true
    cache_validity: 5m

index:
  prefix: index_
  period: 168h
  in-memory-sorted-index:
    retention_period: 24h

gateway:
  enabled: true
  nodeSelector:
    dedicated: logs-instance
  ingress:
    enabled: false
  tolerations:
    - key: dedicated
      operator: Equal
      value: logs-instance
      effect: NoSchedule

Its a big value file, but its a real one we use. OF course, credentials replace with FAKE.

@Jayclifford345 Jayclifford345 merged commit 68d9395 into grafana:main May 28, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm size/M type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grafana loki deploys sometimes with a bad read pod
6 participants