-
Notifications
You must be signed in to change notification settings - Fork 310
Description
🧭 Epic
Title: Prometheus Metrics Instrumentation
Goal: Every FastAPI service publishes a rich set of Prometheus‑compatible metrics (request count, latency, size, in‑progress, custom labels) at /metrics for unified observability across the platform.
Why now: Enables SLO dashboards, proactive alerting, and capacity planning before the traffic ramp in Q4.
🧭 Type of Feature
- Observability / Monitoring
🙋♂️ User Story 1 — Expose metrics endpoint
As a: Site‑Reliability Engineer
I want: each container to expose Prometheus metrics at /metrics
So that: the platform Prometheus server can scrape, store and alert on service behaviour.
✅ Acceptance Criteria
Scenario: Prometheus scrapes metrics
Given a running service container
When Prometheus sends GET /metrics
Then the response status is 200 OK
And the payload contains "http_requests_total" and "http_request_duration_seconds" metrics
🙋♂️ User Story 2 — Standard HTTP request metrics
As a: Backend Developer
I want: automatic instrumentation of request count, latency, and payload sizes broken down by handler, method, and status code
So that: I can track performance regressions without writing boilerplate.
✅ Acceptance Criteria
- Counter
http_requests_total{handler,method,status}
increments on every request. - Histogram
http_request_duration_seconds{handler,method}
uses buckets0.05,0.1,0.3,1,3,5
. - Summary
http_request_size_bytes
&http_response_size_bytes
aggregate payload sizes per handler. - Metrics appear within one scrape interval (≤ 15 s) after the first request.
🙋♂️ User Story 3 — Configurable & performant
As a: Platform Engineer
I want: to toggle instrumentation via an env var and exclude noisy paths
So that: the overhead stays below 3 % CPU and metric cardinality remains manageable.
✅ Acceptance Criteria
- Setting
ENABLE_METRICS=false
disables both instrumentation and the/metrics
route. - Regex list
METRICS_EXCLUDED_HANDLERS
prevents instrumentation of matching paths (e.g..*admin.*
). - P99 latency of a no‑op endpoint increases by < 1 ms with instrumentation enabled.
🗺️ High‑Level Implementation Notes
Area / Component | Change (what/where) |
---|---|
Application code | Create singleton Instrumentator() in service/metrics.py ; configure with should_group_status_codes=False , should_ignore_untemplated=True , should_respect_env_var=True , excluded_handlers=[…] . |
App lifespan | In service/main.py add @app.on_event("startup") to call instrumentator.instrument(app).expose(app, include_in_schema=False, should_gzip=True) when metrics are enabled. |
Environment | New env vars: ENABLE_METRICS="true" (default), METRICS_EXCLUDED_HANDLERS , METRICS_NAMESPACE , METRICS_SUBSYSTEM . |
Helm chart | values.yaml : metrics.enabled , metrics.port , metrics.serviceMonitor.enabled , metrics.customLabels . Template Deployment/ServiceMonitor resources. |
Dockerfile | Add prometheus-fastapi-instrumentator* to pip install layer. No extra port—metrics served on same container port (e.g. 8000). |
CI / Tests | Add job make metrics-test : start container → probe /metrics → assert required metric names exist. |
Docs | docs/docs/manage/observability.md : how to enable metrics locally, dashboards link, common pitfalls (high cardinality, gzip vs CPU). |
🛠 Required Code Changes (proposed)
Codebase Root | File(s) / Module | Change Type | Detail |
---|---|---|---|
service/metrics.py |
NEW | Instrumentator factory with helper def setup(app): … |
|
service/main.py |
FastAPI entry‑point | Call setup_metrics(app) in startup; add metrics_router if separated |
|
service/settings.py |
Pydantic config | Add ENABLE_METRICS , METRICS_* fields with defaults |
|
charts/values.yaml |
Helm values | New key metrics: block |
|
charts/templates/deployment.tpl |
K8s Deployment template | Conditional container env + port + annotations for ServiceMonitor | |
tests/test_metrics.py |
Pytest | Integration test to assert /metrics endpoint presence and sample label set |
|
docs/docs/manage/observability.md |
Docs | Usage guide and troubleshooting |
📐 Design Sketch
sequenceDiagram
participant Prometheus
participant Service
Prometheus->>Service: GET /metrics (scrape)
Service-->>Prometheus: 200 OK + text/plain (metrics)
🔄 Alternatives Considered
Option | Pros | Cons |
---|---|---|
Built‑in prometheus_client middleware |
Full control, battle‑tested | More boilerplate, manual handler mapping |
OpenTelemetry + OTEL Collector exporter | Vendor‑neutral, traces + metrics | Extra infra (collector), multi‑hop latency, slightly higher CPU |
Chosen: prometheus-fastapi-instrumentator |
Minimal code, rich defaults, modular | Slight overhead, fewer power‑user knobs than raw client |
📓 Additional Context / Checklist
-
/metrics
must not require auth inside the cluster; use NetworkPolicy/Ingress to limit external access. - Enable gzip compression by default; measure CPU impact.
- Use
CUSTOM_LABELS
to addservice
&environment
tags for multi‑cluster federation. - Alert rules: HTTP 5xx > 1 % for 5 min, high latency > 1 s P99.
- Dashboard widgets: request rate, error rate, latency histogram, in‑progress gauge.