observability vllm otel slo inference dcgm envoy

Overnight Observability Run — 2026-04-15

15-task observability run across DCGM, OTel, SLO, and failure simulation on DGX Spark GB10

Date: 2026-04-15 Hardware: DGX Spark GB10 (B200) Tasks: 15 across 5 tracks SLO Tier: Tier 1 (99.5%)

Task Completion Matrix

Task Status Key Output / Finding
DCGM-1
Metric inventory
✓ Complete 11 DCGM metric families reachable; FB_USED/FB_FREE absent (field IDs 252/253 not in exporter ConfigMap)
DCGM-2
Causal chain correlations
✓ Complete gpu_power_w ↔ gpu_util r=0.998; sm_clock ↔ gpu_util r=0.938; sglang e2e latency histogram 100% NaN
DCGM-3
OTel correlation dataset
✓ Complete 73 records at 15s interval; vLLM idle during window so TTFT/TPOT histograms sparse
TRACE-1
Boundary instrumentation audit
✓ Complete Zero OTel configured pre-patch; traceparent silently dropped by vLLM; Envoy/vLLM/SGLang all silent
TRACE-2
Tempo deployment
✓ Complete Tempo deployed to tempo.monitoring; added as Grafana datasource; deepseek pod YAML patched with OTLP env vars
TRACE-3
Envoy Gateway OTel patch
⚠ Partial Envoy creates spans (100% sampling, 716 buffered) but h2c protocol mismatch blocks export to OTel Collector; 0 spans ingested by Tempo
LLMD-1
Boundary instrumentation
✓ Complete No llm-d EPP deployed; routing is pure Envoy HTTPRoutes; 35 vLLM metric series confirmed including TTFT, TPOT, KV cache %
LLMD-2
Queue metrics
✓ Complete TTFT bimodal: p50=375ms cold, 30–40s for DeepSeek-R1 reasoning chains; TPOT stable at 125ms/tok; queue depth 0 at concurrency=6
LLMD-3
Pod failure simulation
✓ Complete Bare pod force-deleted → immediate HTTP 500; zero alerts fired; metric series went stale (no 0-value transition); pod permanently absent
SLO-1
Baseline measurement
✓ Complete TTFT p95: Bucket A=44.8s, B=39.1s, C=37.0s across 167 requests with 0 errors
SLO-2
SLO specification
✓ Complete 3 tiers defined: Dev 99.5%, Interactive 99.9%, Prod 99.95%; Tier 1 achievable today; Tier 2 requires 7–15s TTFT reduction
SLO-3
PrometheusRules
✓ Complete token-labs-slos PrometheusRules applied; multi-burn-rate availability alerts live for Tier 2 across 1h/6h/1d windows
FAILURE-1
KV pressure cascade
✓ Complete Peak KV cache 1.98% at concurrency=12 with 700-token prompts; 95% cascade unreachable on GB10 — hardware too fast for pressure buildup
FAILURE-2
NVLink failure
✗ Blocked NVLink metrics not available on GB10 SoC; hardware limitation, not a configuration issue
FAILURE-3
Silent truncation
✓ Complete 0% server-side truncation across 50 requests; real risk is client connector pool exhaustion (45/50 requests silently dropped in initial test)
SYNTHESIS-1
Gap closure report
✓ Complete overnight_run_summary.md generated
SYNTHESIS-2
Next run scope / this page
✓ Complete HTML report published to tokenlabs.run/overnight-run-2026-04-15.html

Top 3 Findings

Finding 01
Bare pods = silent operational risk: no controller, no auto-restart, zero alerts fired on force-delete
The deepseek pod ran as a bare pod with no Deployment, StatefulSet, or ownerReference. Force-delete produced 100% HTTP 500 with zero alert events — no PodAbsent rule, no EndpointDown rule, no vLLM metric staleness alert. Metric series did not transition to zero; they disappeared entirely, making threshold alerts blind. This pattern likely applies to all inference pods and is the highest single operational risk in the cluster.
Finding 02
GB10 B200 defeats KV pressure tests: peak 1.98% at concurrency=12 — target of 95% unreachable at this scale
FAILURE-1 targeted 95% KV cache saturation with 12 concurrent 700-token prompts and peaked at 1.98%. The B200 unified memory architecture processes and frees KV blocks faster than the 15s Prometheus scrape window can capture saturation. Reaching real KV pressure requires 32k-token contexts, 50+ concurrent sessions, or artificial read throttling — none of which the current workload generator supports.
Finding 03
TTFT p95 tail driven by prefill blocking: p50=7.7s vs p95=44.8s at concurrency=6 — chunked prefill is the fix
At concurrency=6, queue depth held at 0 and TPOT was stable at 125ms/tok — the GPU was not saturated. Yet p95 TTFT reaches 44.8s (Bucket A) while p50 is 7.7s. The spread indicates a single large-prompt request monopolizing the prefill phase and stalling TTFT for all concurrent requests. --enable-chunked-prefill is not enabled on any pod and is the highest-leverage single config change to close the Tier 2 gap.

Key Numbers

Metric Value Notes
TTFT p50 — Bucket A (Interactive) 7.7s n=104, 0 errors
TTFT p95 — Bucket A (Interactive) 44.8s 14.8s gap vs Tier 2 target of 30s
TTFT p50 — Bucket B (Batch-Light) 10.5s n=42
TTFT p95 — Bucket B (Batch-Light) 39.1s 9.1s gap vs Tier 2 target of 30s
TTFT p95 — Bucket C (Long-Context) 37.0s 7.0s gap vs Tier 2 target of 30s (low-n; p95 noisy)
TPOT p50 (DeepSeek-R1, concurrency=6) 125ms/tok ~8 tok/s per request, stable throughout
KV cache peak (concurrency=12) 1.98% Never approached 95% cascade threshold
GPU util peak (spark-02, sglang) 96% During Llama-3.1-8B sglang load run
gpu_power_w ↔ gpu_util correlation r=0.998 Power is a reliable GPU utilization proxy
Trace coverage end-to-end (client → vLLM) 0% h2c protocol bug blocks Envoy → OTel Collector export
Envoy spans created 716 100% sampling rate, all buffered — 0 ingested by Tempo
vLLM metric series total 91 Includes prefill/decode time, KV cache %, queue depth
SLO tier achieved today Tier 1 99.5% — Tier 2 (99.9%) requires chunked prefill + prefix cache

Top 3 Gaps Open

Gap 01
E2E tracing unverified: Envoy → OTel Collector h2c protocol mismatch blocks export
Envoy Gateway is configured with backendRef: otelcol.token-labs:4317 but sends over h2c (HTTP/2 cleartext). The OTel Collector rejects the connection — 716 spans buffered in Envoy, 0 ingested by Tempo. Fix: configure gRPC over TLS on the Collector listener, or switch Envoy to HTTP/1.1 OTLP export on port 4318. vLLM tracing is additionally blocked pending a pod restart to activate the patched OTEL_EXPORTER_OTLP_ENDPOINT env vars.
Gap 02
FB_USED/FB_FREE absent from Prometheus: DCGM field IDs 252/253 not in exporter ConfigMap
HBM utilization metrics are completely absent from the Prometheus scrape. This severs the causal chain from GPU memory pressure to vLLM KV cache behavior — there is no way to correlate approaching memory limits with queue preemptions. The DCGM ServiceMonitor was created this run, but enabling fields 252/253 requires a ConfigMap edit and exporter restart (two-minute procedure).
Gap 03
No alert for pod absence or metric staleness: SLO burn-rate rules are blind to gateway-layer failures
No PrometheusRule fires on pod absence, zero-endpoint services, or stale vLLM metric series. The SLO burn-rate rules added this run fire on vllm:request_failure_total rate — which never increments when the pod is gone and requests fail at the Envoy layer before reaching vLLM. LLMD-3 proved this explicitly: full pod death produced zero alert events.

Infrastructure Changes Applied This Run

Next Run Scope

Raw Artifacts

overnight_run_summary.md trace_coverage_report.md dcgm_metric_inventory.md causal_chain_stats.md dcgm_otel_correlation.json llmd_boundary_instrumentation.md llmd_queue_metrics.md llmd_failure_sim.md baseline_latency_distributions.json slo_spec.md failure_kv_pressure.md failure_silent_truncation.md

Artifacts are local to the controller node at /home/nvidia/overnight-run/2026-04-15/