Skip to content

[Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218

@crivetimihai

Description

@crivetimihai

🧭 Epic

Title: Prometheus Metrics Instrumentation
Goal: Every FastAPI service publishes a rich set of Prometheus‑compatible metrics (request count, latency, size, in‑progress, custom labels) at /metrics for unified observability across the platform.
Why now: Enables SLO dashboards, proactive alerting, and capacity planning before the traffic ramp in Q4.


🧭 Type of Feature

  • Observability / Monitoring

🙋‍♂️ User Story 1 — Expose metrics endpoint

As a: Site‑Reliability Engineer
I want: each container to expose Prometheus metrics at /metrics
So that: the platform Prometheus server can scrape, store and alert on service behaviour.

✅ Acceptance Criteria

Scenario: Prometheus scrapes metrics
Given a running service container
When Prometheus sends GET /metrics
Then the response status is 200 OK
And the payload contains "http_requests_total" and "http_request_duration_seconds" metrics

🙋‍♂️ User Story 2 — Standard HTTP request metrics

As a: Backend Developer
I want: automatic instrumentation of request count, latency, and payload sizes broken down by handler, method, and status code
So that: I can track performance regressions without writing boilerplate.

✅ Acceptance Criteria

  • Counter http_requests_total{handler,method,status} increments on every request.
  • Histogram http_request_duration_seconds{handler,method} uses buckets 0.05,0.1,0.3,1,3,5.
  • Summary http_request_size_bytes & http_response_size_bytes aggregate payload sizes per handler.
  • Metrics appear within one scrape interval (≤ 15 s) after the first request.

🙋‍♂️ User Story 3 — Configurable & performant

As a: Platform Engineer
I want: to toggle instrumentation via an env var and exclude noisy paths
So that: the overhead stays below 3 % CPU and metric cardinality remains manageable.

✅ Acceptance Criteria

  • Setting ENABLE_METRICS=false disables both instrumentation and the /metrics route.
  • Regex list METRICS_EXCLUDED_HANDLERS prevents instrumentation of matching paths (e.g. .*admin.*).
  • P99 latency of a no‑op endpoint increases by < 1 ms with instrumentation enabled.

🗺️ High‑Level Implementation Notes

Area / Component Change (what/where)
Application code Create singleton Instrumentator() in service/metrics.py; configure with should_group_status_codes=False, should_ignore_untemplated=True, should_respect_env_var=True, excluded_handlers=[…].
App lifespan In service/main.py add @app.on_event("startup") to call instrumentator.instrument(app).expose(app, include_in_schema=False, should_gzip=True) when metrics are enabled.
Environment New env vars: ENABLE_METRICS="true" (default), METRICS_EXCLUDED_HANDLERS, METRICS_NAMESPACE, METRICS_SUBSYSTEM.
Helm chart values.yaml: metrics.enabled, metrics.port, metrics.serviceMonitor.enabled, metrics.customLabels. Template Deployment/ServiceMonitor resources.
Dockerfile Add prometheus-fastapi-instrumentator* to pip install layer. No extra port—metrics served on same container port (e.g. 8000).
CI / Tests Add job make metrics-test: start container → probe /metrics → assert required metric names exist.
Docs docs/docs/manage/observability.md: how to enable metrics locally, dashboards link, common pitfalls (high cardinality, gzip vs CPU).

🛠 Required Code Changes (proposed)

Codebase Root File(s) / Module Change Type Detail
service/metrics.py NEW Instrumentator factory with helper def setup(app): …
service/main.py FastAPI entry‑point Call setup_metrics(app) in startup; add metrics_router if separated
service/settings.py Pydantic config Add ENABLE_METRICS, METRICS_* fields with defaults
charts/values.yaml Helm values New key metrics: block
charts/templates/deployment.tpl K8s Deployment template Conditional container env + port + annotations for ServiceMonitor
tests/test_metrics.py Pytest Integration test to assert /metrics endpoint presence and sample label set
docs/docs/manage/observability.md Docs Usage guide and troubleshooting

📐 Design Sketch

sequenceDiagram
    participant Prometheus
    participant Service
    Prometheus->>Service: GET /metrics (scrape)
    Service-->>Prometheus: 200 OK + text/plain (metrics)
Loading

🔄 Alternatives Considered

Option Pros Cons
Built‑in prometheus_client middleware Full control, battle‑tested More boilerplate, manual handler mapping
OpenTelemetry + OTEL Collector exporter Vendor‑neutral, traces + metrics Extra infra (collector), multi‑hop latency, slightly higher CPU
Chosen: prometheus-fastapi-instrumentator Minimal code, rich defaults, modular Slight overhead, fewer power‑user knobs than raw client

📓 Additional Context / Checklist

  • /metrics must not require auth inside the cluster; use NetworkPolicy/Ingress to limit external access.
  • Enable gzip compression by default; measure CPU impact.
  • Use CUSTOM_LABELS to add service & environment tags for multi‑cluster federation.
  • Alert rules: HTTP 5xx > 1 % for 5 min, high latency > 1 s P99.
  • Dashboard widgets: request rate, error rate, latency histogram, in‑progress gauge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestobservabilityObservability, logging, monitoringpythonPython / backend development (FastAPI)triageIssues / Features awaiting triage

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions