Overnight Observability Run — 2026-04-15

Task Completion Matrix

Task	Status	Key Output / Finding
DCGM-1 Metric inventory	✓ Complete	11 DCGM metric families reachable; FB_USED/FB_FREE absent (field IDs 252/253 not in exporter ConfigMap)
DCGM-2 Causal chain correlations	✓ Complete	gpu_power_w ↔ gpu_util r=0.998; sm_clock ↔ gpu_util r=0.938; sglang e2e latency histogram 100% NaN
DCGM-3 OTel correlation dataset	✓ Complete	73 records at 15s interval; vLLM idle during window so TTFT/TPOT histograms sparse
TRACE-1 Boundary instrumentation audit	✓ Complete	Zero OTel configured pre-patch; traceparent silently dropped by vLLM; Envoy/vLLM/SGLang all silent
TRACE-2 Tempo deployment	✓ Complete	Tempo deployed to `tempo.monitoring`; added as Grafana datasource; deepseek pod YAML patched with OTLP env vars
TRACE-3 Envoy Gateway OTel patch	⚠ Partial	Envoy creates spans (100% sampling, 716 buffered) but h2c protocol mismatch blocks export to OTel Collector; 0 spans ingested by Tempo
LLMD-1 Boundary instrumentation	✓ Complete	No llm-d EPP deployed; routing is pure Envoy HTTPRoutes; 35 vLLM metric series confirmed including TTFT, TPOT, KV cache %
LLMD-2 Queue metrics	✓ Complete	TTFT bimodal: p50=375ms cold, 30–40s for DeepSeek-R1 reasoning chains; TPOT stable at 125ms/tok; queue depth 0 at concurrency=6
LLMD-3 Pod failure simulation	✓ Complete	Bare pod force-deleted → immediate HTTP 500; zero alerts fired; metric series went stale (no 0-value transition); pod permanently absent
SLO-1 Baseline measurement	✓ Complete	TTFT p95: Bucket A=44.8s, B=39.1s, C=37.0s across 167 requests with 0 errors
SLO-2 SLO specification	✓ Complete	3 tiers defined: Dev 99.5%, Interactive 99.9%, Prod 99.95%; Tier 1 achievable today; Tier 2 requires 7–15s TTFT reduction
SLO-3 PrometheusRules	✓ Complete	`token-labs-slos` PrometheusRules applied; multi-burn-rate availability alerts live for Tier 2 across 1h/6h/1d windows
FAILURE-1 KV pressure cascade	✓ Complete	Peak KV cache 1.98% at concurrency=12 with 700-token prompts; 95% cascade unreachable on GB10 — hardware too fast for pressure buildup
FAILURE-2 NVLink failure	✗ Blocked	NVLink metrics not available on GB10 SoC; hardware limitation, not a configuration issue
FAILURE-3 Silent truncation	✓ Complete	0% server-side truncation across 50 requests; real risk is client connector pool exhaustion (45/50 requests silently dropped in initial test)
SYNTHESIS-1 Gap closure report	✓ Complete	overnight_run_summary.md generated
SYNTHESIS-2 Next run scope / this page	✓ Complete	HTML report published to tokenlabs.run/overnight-run-2026-04-15.html

Top 3 Findings

Finding 01

Bare pods = silent operational risk: no controller, no auto-restart, zero alerts fired on force-delete

The deepseek pod ran as a bare pod with no Deployment, StatefulSet, or ownerReference. Force-delete produced 100% HTTP 500 with zero alert events — no PodAbsent rule, no EndpointDown rule, no vLLM metric staleness alert. Metric series did not transition to zero; they disappeared entirely, making threshold alerts blind. This pattern likely applies to all inference pods and is the highest single operational risk in the cluster.

Finding 02

GB10 B200 defeats KV pressure tests: peak 1.98% at concurrency=12 — target of 95% unreachable at this scale

FAILURE-1 targeted 95% KV cache saturation with 12 concurrent 700-token prompts and peaked at 1.98%. The B200 unified memory architecture processes and frees KV blocks faster than the 15s Prometheus scrape window can capture saturation. Reaching real KV pressure requires 32k-token contexts, 50+ concurrent sessions, or artificial read throttling — none of which the current workload generator supports.

Finding 03

TTFT p95 tail driven by prefill blocking: p50=7.7s vs p95=44.8s at concurrency=6 — chunked prefill is the fix

At concurrency=6, queue depth held at 0 and TPOT was stable at 125ms/tok — the GPU was not saturated. Yet p95 TTFT reaches 44.8s (Bucket A) while p50 is 7.7s. The spread indicates a single large-prompt request monopolizing the prefill phase and stalling TTFT for all concurrent requests. --enable-chunked-prefill is not enabled on any pod and is the highest-leverage single config change to close the Tier 2 gap.

Key Numbers

Metric	Value	Notes
TTFT p50 — Bucket A (Interactive)	7.7s	n=104, 0 errors
TTFT p95 — Bucket A (Interactive)	44.8s	14.8s gap vs Tier 2 target of 30s
TTFT p50 — Bucket B (Batch-Light)	10.5s	n=42
TTFT p95 — Bucket B (Batch-Light)	39.1s	9.1s gap vs Tier 2 target of 30s
TTFT p95 — Bucket C (Long-Context)	37.0s	7.0s gap vs Tier 2 target of 30s (low-n; p95 noisy)
TPOT p50 (DeepSeek-R1, concurrency=6)	125ms/tok	~8 tok/s per request, stable throughout
KV cache peak (concurrency=12)	1.98%	Never approached 95% cascade threshold
GPU util peak (spark-02, sglang)	96%	During Llama-3.1-8B sglang load run
gpu_power_w ↔ gpu_util correlation	r=0.998	Power is a reliable GPU utilization proxy
Trace coverage end-to-end (client → vLLM)	0%	h2c protocol bug blocks Envoy → OTel Collector export
Envoy spans created	716	100% sampling rate, all buffered — 0 ingested by Tempo
vLLM metric series total	91	Includes prefill/decode time, KV cache %, queue depth
SLO tier achieved today	Tier 1	99.5% — Tier 2 (99.9%) requires chunked prefill + prefix cache

Top 3 Gaps Open

Gap 01

E2E tracing unverified: Envoy → OTel Collector h2c protocol mismatch blocks export

Envoy Gateway is configured with backendRef: otelcol.token-labs:4317 but sends over h2c (HTTP/2 cleartext). The OTel Collector rejects the connection — 716 spans buffered in Envoy, 0 ingested by Tempo. Fix: configure gRPC over TLS on the Collector listener, or switch Envoy to HTTP/1.1 OTLP export on port 4318. vLLM tracing is additionally blocked pending a pod restart to activate the patched OTEL_EXPORTER_OTLP_ENDPOINT env vars.

Gap 02

FB_USED/FB_FREE absent from Prometheus: DCGM field IDs 252/253 not in exporter ConfigMap

HBM utilization metrics are completely absent from the Prometheus scrape. This severs the causal chain from GPU memory pressure to vLLM KV cache behavior — there is no way to correlate approaching memory limits with queue preemptions. The DCGM ServiceMonitor was created this run, but enabling fields 252/253 requires a ConfigMap edit and exporter restart (two-minute procedure).

Gap 03

No alert for pod absence or metric staleness: SLO burn-rate rules are blind to gateway-layer failures

No PrometheusRule fires on pod absence, zero-endpoint services, or stale vLLM metric series. The SLO burn-rate rules added this run fire on vllm:request_failure_total rate — which never increments when the pod is gone and requests fail at the Envoy layer before reaching vLLM. LLMD-3 proved this explicitly: full pod death produced zero alert events.

Infrastructure Changes Applied This Run

DCGM ServiceMonitor created — Prometheus now scrapes nvidia-dcgm-exporter port 9400 in the gpu-operator namespace
Tempo deployed to monitoring namespace (tempo.monitoring) and added as Grafana datasource; gRPC port 4317 and HTTP 4318 active
token-labs-slos PrometheusRules applied — multi-burn-rate availability alerts for Tier 2 (99.9%) across 1h/6h/1d windows
Envoy Gateway OTel config patch applied — 100% sampling rate, OpenTelemetry backend configured (trace export verification pending h2c fix)
deepseek-r1-7b vLLM manifest patched — added OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, and --otlp-traces-endpoint to pod YAML; takes effect on next pod restart
deepseek-r1-7b-vllm-leader pod restarted after LLMD-3 force-delete simulation (restartPolicy: Never required manual delete/recreate)

Next Run Scope

Close the trace gap first. Verify a real waterfall trace in Grafana Tempo from Envoy ingress through vLLM to completion. Diagnose by checking kubectl get envoyproxy -n token-labs -o yaml, confirm OTEL_EXPORTER_OTLP_ENDPOINT resolves to the correct Tempo OTLP port (4317 vs 4318), and send a test request with traceparent header while watching Tempo. Fix the h2c mismatch (switch to HTTP/1.1 on port 4318, or enable TLS on gRPC listener).
Enable chunked prefill and measure the TTFT p95 delta. Add --enable-chunked-prefill --max-num-batched-tokens 2048 to deepseek and llama vLLM serving args, then re-run the 20-minute concurrent bucket load test and compare p95 TTFT per bucket against this baseline. Target: close the 14.8s Bucket A gap.
Deploy pod-absence and metric-staleness alert rules. Apply four PromQL rules from llmd_failure_sim.md: TokenLabsPodAbsent, VLLMMetricsAbsent, TokenLabsServiceNoEndpoints, and SLOBurnRateGatewayLayer. Wrap all inference pods in Deployments. Validate each rule fires correctly by re-running the LLMD-3 force-delete simulation.

Raw Artifacts

overnight_run_summary.md trace_coverage_report.md dcgm_metric_inventory.md causal_chain_stats.md dcgm_otel_correlation.json llmd_boundary_instrumentation.md llmd_queue_metrics.md llmd_failure_sim.md baseline_latency_distributions.json slo_spec.md failure_kv_pressure.md failure_silent_truncation.md

Artifacts are local to the controller node at /home/nvidia/overnight-run/2026-04-15/