Skip to content

Extra telemetry on policy evaluation failure #661

@the-nando

Description

@the-nando

We recently run into two separate issue where the Nomad autoscaler failed to describe AWS autoscaling groups due to an expired AWS token or failed to evaluate a scaling policy because of an issue reaching the APM (Prometheus).

{"@level":"warn","@message":"failed to get target status","@module":"policy_manager.policy_handler","@timestamp":"2023-07-06T16:21:26.029652Z","error":"failed to describe AWS Autoscaling Group: operation error Auto Scaling: DescribeAutoScalingGroups, https response error StatusCode: 403, RequestID: c674bc86-1234-4fb1-5678-b264741176bc, api error ExpiredToken: The security token included in the request is expired","policy_id":"613aeb80-xs23-8f4e-1234-ef2ca2748d8a"}

It would be great to have a couple of extra Prometheus metrics exported by the autoscaler to be monitored to detect simple failures.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions