Extra telemetry on policy evaluation failure

We recently run into two separate issue where the Nomad autoscaler failed to describe AWS autoscaling groups due to an expired AWS token or failed to evaluate a scaling policy because of an issue reaching the APM (Prometheus).

```
{"@level":"warn","@message":"failed to get target status","@module":"policy_manager.policy_handler","@timestamp":"2023-07-06T16:21:26.029652Z","error":"failed to describe AWS Autoscaling Group: operation error Auto Scaling: DescribeAutoScalingGroups, https response error StatusCode: 403, RequestID: c674bc86-1234-4fb1-5678-b264741176bc, api error ExpiredToken: The security token included in the request is expired","policy_id":"613aeb80-xs23-8f4e-1234-ef2ca2748d8a"}
```

It would be great to have a couple of extra Prometheus metrics exported by the autoscaler to be monitored to detect simple failures. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extra telemetry on policy evaluation failure #661

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extra telemetry on policy evaluation failure #661

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions