Skip to content

bug: lua_shared_dict prometheus-metrics overflow #11948

@DrJSAnD

Description

@DrJSAnD

Current Behavior

We deploy apisix in K8s cluster and have problem with prometheus metrics.
We noticed that lua_shared_dict prometheus-metrics overflows, then the number of apisix_nginx_metric_errors_total errors starts to grow and all metrics stop displaying correctly.

Image

We try increase the prometheus-metrics parameter to 40m in the ConfigMap (config.yaml), but after 2 months this lua_shared_dict was full on all pods and errors started to occur again.

Image

nginx_config:    # config for render the template to genarate nginx.conf
  error_log: "/dev/stderr"
  error_log_level: "warn"    # warn,error
  worker_processes: "auto"
  enable_cpu_affinity: true
  worker_rlimit_nofile: 20480  # the number of files a worker process can open, should be larger than worker_connections
  event:
    worker_connections: 10620
  http:
    enable_access_log: true
    access_log: "/dev/stdout"
    access_log_format: '$remote_addr - $remote_user [$time_local] $http_host \"$request\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\"'
    access_log_format_escape: default
    keepalive_timeout: "60s"
    client_header_timeout: 60s     # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
    client_body_timeout: 60s       # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
    send_timeout: 10s              # timeout for transmitting a response to the client.then the connection is closed
    underscores_in_headers: "on"   # default enables the use of underscores in client request header fields
    real_ip_header: "X-Real-IP"    # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
    real_ip_from:                  # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
      - 127.0.0.1
      - 'unix:'
    lua_shared_dict:
      prometheus-metrics: 40m

Current Apisix state

  • Deployment via Helm chart: https://github.com/apache/apisix-helm-chart
  • Helm Chart version: 2.10.0
  • K8s pods: 3
  • Pod CPU limits: 15 (usage 4%)
  • Pod Memory limits: 60Gb (usage 35 GiB)
  • Total requests per second: 2500 - 3000
  • Active connections: 2000+
  • Upstreams: 100+
  • Routes: 120+
  • Consumers: 60+
  • Plugins: basic-auth and kafka-logger on all routes

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

  1. Run apisix with default lua_shared_dict: prometheus-metrics
  2. After 2-3 weeks prometheus-metrics overflows and apisix_nginx_metric_errors_total errors starts to grow and all metrics stop displaying correctly
  3. Change lua_shared_dict: prometheus-metrics to 40m
  4. After 2-3 months lua_shared_dict overflows again and we get a similar problem with displaying metrics

Environment

  • APISIX version (run apisix version): 3.10.0
  • Operating system (run uname -a): Linux apisix-69cfdc5fbf-m7k27 5.14.0-362.13.1.el9_3.x86_64 SMP PREEMPT_DYNAMIC Fri Nov 24 01:57:57 EST 2023 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.25.3.2
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.0
  • APISIX Dashboard version, if relevant: 3.0.0
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions