-
Notifications
You must be signed in to change notification settings - Fork 92
Open
Description
We recently run into two separate issue where the Nomad autoscaler failed to describe AWS autoscaling groups due to an expired AWS token or failed to evaluate a scaling policy because of an issue reaching the APM (Prometheus).
{"@level":"warn","@message":"failed to get target status","@module":"policy_manager.policy_handler","@timestamp":"2023-07-06T16:21:26.029652Z","error":"failed to describe AWS Autoscaling Group: operation error Auto Scaling: DescribeAutoScalingGroups, https response error StatusCode: 403, RequestID: c674bc86-1234-4fb1-5678-b264741176bc, api error ExpiredToken: The security token included in the request is expired","policy_id":"613aeb80-xs23-8f4e-1234-ef2ca2748d8a"}
It would be great to have a couple of extra Prometheus metrics exported by the autoscaler to be monitored to detect simple failures.