# Grafana Observability Stack
−Production observability built on three pillars: metrics (Prometheus), logs (Loki or structured JSON), and traces (Tempo or Sentry). Grafana unifies the view.
+Production observability built on three pillars: metrics (Prometheus), logs (Loki or structured JSON), and traces (Tempo, OpenTelemetry, or Sentry). Grafana (v12+) unifies the view and adds observability-as-code tools (Git Sync, CLI, Terraform provider) and managed alerting.
## When to use
- Monitoring API latency, error rates, and throughput in production
−- Setting up alerting rules for SLA/SLO violations
+- Setting up alerting rules for SLA/SLO violations (use managed alerts or alerting-as-code for repeatability)
- Debugging distributed request flows across multiple services
- Correlating errors with deployment events
- Tracking business metrics alongside infrastructure health
@@ −14 +14 @@
## When NOT to use
−- Application performance profiling at the code level (use a profiler, not traces)
+- Application performance profiling at the code level (use a profiler or OpenTelemetry Profiles for continuous profiling when available)
- Log analysis at petabyte scale (consider a dedicated SIEM)
- When only a single monolith exists with minimal traffic (a simple health check suffices)
−- Real-time user analytics (use product analytics tools like PostHog or Amplitude)
+- Real-time product analytics (use PostHog, Amplitude, etc.)
- Compliance audit logging with legal retention requirements (use append-only audit stores)
## Core concepts
@@ −28 +28 @@
|---------|-------------|----------------------------------------|
| Metrics | Prometheus | "How is the system performing right now?" |
| Logs | Loki / JSON | "What happened during this request?" |
+| Traces | Tempo / OpenTelemetry / Sentry | "How did the request flow across services?" |
+
−| Traces | Tempo / Sentry | "How did the request flow across services?" |
+Note: Grafana v12 (2025–2026) adds tighter integrations for managing dashboards and alerts as code (Git Sync, Terraform provider, CLI) and improved visualization suggestions — prefer these features for reproducible observability configurations (see Grafana docs: What’s new in v12).
### Prometheus data model
@@ −40 +42 @@
### Structured logging
−Structured logs are JSON objects with consistent fields. They replace \`console.log\` strings with queryable data.
+Structured logs are JSON objects with consistent fields. They replace `console.log` strings with queryable data.
```json
{
@@ −57 +59 @@
### Distributed tracing
A trace is a tree of spans. Each span represents a unit of work (HTTP request, DB query, external API call) with timing, status, and metadata.
+
+Note: OpenTelemetry made several platform-level changes in 2025–2026: declarative collector configuration is now stable (use it for Collector pipelines), Kubernetes semantic conventions are moving toward RC/stable attributes, and the Span Events API was deprecated (March 2026). Also, OpenTelemetry Profiles (continuous profiling) entered public alpha in 2026 — watch for a production-ready profiling option in the OTel ecosystem.
## Workflow
+
+### Step 0 — Choose your collector
+
+- Production recommendation: use a maintained OpenTelemetry Collector distribution (Grafana Alloy or upstream collector-contrib) to receive OTLP, run transforms, and forward metrics/logs/traces to backends.
+- Note: Grafana Agent reached End-of-Life on 2025-11-01. Migrate Agent deployments to Grafana Alloy or another Collector distribution — Alloy provides Prometheus pipeline compatibility and remote_write support.
+- For Prometheus ingestion at scale, prefer remote_write pipelines to long-term storage (Remote-Write v2 adoption is increasing). Ensure labels and external_labels are applied consistently.
+
+### Step 1 — Instrument with OpenTelemetry (SDK + Collector)
−### Step 1 — Instrument with OpenTelemetry
+Use the OpenTelemetry SDK for application-level instrumentation and export via OTLP to your Collector.
```typescript
// lib/telemetry.ts
@@ −86 +98 @@
sdk.start();
```
+
+Notes:
+- Prefer sending telemetry to an intermediate Collector (Alloy / collector-contrib) rather than directly to storage backends. Declarative collector configuration (stable) simplifies pipeline management.
+- Follow current OpenTelemetry semantic conventions for attribute names; monitor the K8s semantic conventions changes if you rely on pod/node attributes.
+- Because the Span Events API was deprecated in 2026, prefer attributes and structured logs for per-span annotations where possible (see OpenTelemetry blog posts on deprecations and declarative config).
### Step 2 — Add structured logging