Overnight Observability Run — 2026-04-15
15-task observability run across DCGM, OTel, SLO, and failure simulation on DGX Spark GB10
Task Completion Matrix
| Task | Status | Key Output / Finding |
|---|---|---|
|
DCGM-1
Metric inventory
|
✓ Complete | 11 DCGM metric families reachable; FB_USED/FB_FREE absent (field IDs 252/253 not in exporter ConfigMap) |
|
DCGM-2
Causal chain correlations
|
✓ Complete | gpu_power_w ↔ gpu_util r=0.998; sm_clock ↔ gpu_util r=0.938; sglang e2e latency histogram 100% NaN |
|
DCGM-3
OTel correlation dataset
|
✓ Complete | 73 records at 15s interval; vLLM idle during window so TTFT/TPOT histograms sparse |
|
TRACE-1
Boundary instrumentation audit
|
✓ Complete | Zero OTel configured pre-patch; traceparent silently dropped by vLLM; Envoy/vLLM/SGLang all silent |
|
TRACE-2
Tempo deployment
|
✓ Complete | Tempo deployed to tempo.monitoring; added as Grafana datasource; deepseek pod YAML patched with OTLP env vars |
|
TRACE-3
Envoy Gateway OTel patch
|
⚠ Partial | Envoy creates spans (100% sampling, 716 buffered) but h2c protocol mismatch blocks export to OTel Collector; 0 spans ingested by Tempo |
|
LLMD-1
Boundary instrumentation
|
✓ Complete | No llm-d EPP deployed; routing is pure Envoy HTTPRoutes; 35 vLLM metric series confirmed including TTFT, TPOT, KV cache % |
|
LLMD-2
Queue metrics
|
✓ Complete | TTFT bimodal: p50=375ms cold, 30–40s for DeepSeek-R1 reasoning chains; TPOT stable at 125ms/tok; queue depth 0 at concurrency=6 |
|
LLMD-3
Pod failure simulation
|
✓ Complete | Bare pod force-deleted → immediate HTTP 500; zero alerts fired; metric series went stale (no 0-value transition); pod permanently absent |
|
SLO-1
Baseline measurement
|
✓ Complete | TTFT p95: Bucket A=44.8s, B=39.1s, C=37.0s across 167 requests with 0 errors |
|
SLO-2
SLO specification
|
✓ Complete | 3 tiers defined: Dev 99.5%, Interactive 99.9%, Prod 99.95%; Tier 1 achievable today; Tier 2 requires 7–15s TTFT reduction |
|
SLO-3
PrometheusRules
|
✓ Complete | token-labs-slos PrometheusRules applied; multi-burn-rate availability alerts live for Tier 2 across 1h/6h/1d windows |
|
FAILURE-1
KV pressure cascade
|
✓ Complete | Peak KV cache 1.98% at concurrency=12 with 700-token prompts; 95% cascade unreachable on GB10 — hardware too fast for pressure buildup |
|
FAILURE-2
NVLink failure
|
✗ Blocked | NVLink metrics not available on GB10 SoC; hardware limitation, not a configuration issue |
|
FAILURE-3
Silent truncation
|
✓ Complete | 0% server-side truncation across 50 requests; real risk is client connector pool exhaustion (45/50 requests silently dropped in initial test) |
|
SYNTHESIS-1
Gap closure report
|
✓ Complete | overnight_run_summary.md generated |
|
SYNTHESIS-2
Next run scope / this page
|
✓ Complete | HTML report published to tokenlabs.run/overnight-run-2026-04-15.html |
Top 3 Findings
Finding 01
Bare pods = silent operational risk: no controller, no auto-restart, zero alerts fired on force-delete
The deepseek pod ran as a bare pod with no Deployment, StatefulSet, or ownerReference. Force-delete produced 100% HTTP 500 with zero alert events — no PodAbsent rule, no EndpointDown rule, no vLLM metric staleness alert. Metric series did not transition to zero; they disappeared entirely, making threshold alerts blind. This pattern likely applies to all inference pods and is the highest single operational risk in the cluster.
Finding 02
GB10 B200 defeats KV pressure tests: peak 1.98% at concurrency=12 — target of 95% unreachable at this scale
FAILURE-1 targeted 95% KV cache saturation with 12 concurrent 700-token prompts and peaked at 1.98%. The B200 unified memory architecture processes and frees KV blocks faster than the 15s Prometheus scrape window can capture saturation. Reaching real KV pressure requires 32k-token contexts, 50+ concurrent sessions, or artificial read throttling — none of which the current workload generator supports.
Finding 03
TTFT p95 tail driven by prefill blocking: p50=7.7s vs p95=44.8s at concurrency=6 — chunked prefill is the fix
At concurrency=6, queue depth held at 0 and TPOT was stable at 125ms/tok — the GPU was not saturated. Yet p95 TTFT reaches 44.8s (Bucket A) while p50 is 7.7s. The spread indicates a single large-prompt request monopolizing the prefill phase and stalling TTFT for all concurrent requests.
--enable-chunked-prefill is not enabled on any pod and is the highest-leverage single config change to close the Tier 2 gap.
Key Numbers
| Metric | Value | Notes |
|---|---|---|
| TTFT p50 — Bucket A (Interactive) | 7.7s | n=104, 0 errors |
| TTFT p95 — Bucket A (Interactive) | 44.8s | 14.8s gap vs Tier 2 target of 30s |
| TTFT p50 — Bucket B (Batch-Light) | 10.5s | n=42 |
| TTFT p95 — Bucket B (Batch-Light) | 39.1s | 9.1s gap vs Tier 2 target of 30s |
| TTFT p95 — Bucket C (Long-Context) | 37.0s | 7.0s gap vs Tier 2 target of 30s (low-n; p95 noisy) |
| TPOT p50 (DeepSeek-R1, concurrency=6) | 125ms/tok | ~8 tok/s per request, stable throughout |
| KV cache peak (concurrency=12) | 1.98% | Never approached 95% cascade threshold |
| GPU util peak (spark-02, sglang) | 96% | During Llama-3.1-8B sglang load run |
| gpu_power_w ↔ gpu_util correlation | r=0.998 | Power is a reliable GPU utilization proxy |
| Trace coverage end-to-end (client → vLLM) | 0% | h2c protocol bug blocks Envoy → OTel Collector export |
| Envoy spans created | 716 | 100% sampling rate, all buffered — 0 ingested by Tempo |
| vLLM metric series total | 91 | Includes prefill/decode time, KV cache %, queue depth |
| SLO tier achieved today | Tier 1 | 99.5% — Tier 2 (99.9%) requires chunked prefill + prefix cache |
Top 3 Gaps Open
Gap 01
E2E tracing unverified: Envoy → OTel Collector h2c protocol mismatch blocks export
Envoy Gateway is configured with
backendRef: otelcol.token-labs:4317 but sends over h2c (HTTP/2 cleartext). The OTel Collector rejects the connection — 716 spans buffered in Envoy, 0 ingested by Tempo. Fix: configure gRPC over TLS on the Collector listener, or switch Envoy to HTTP/1.1 OTLP export on port 4318. vLLM tracing is additionally blocked pending a pod restart to activate the patched OTEL_EXPORTER_OTLP_ENDPOINT env vars.
Gap 02
FB_USED/FB_FREE absent from Prometheus: DCGM field IDs 252/253 not in exporter ConfigMap
HBM utilization metrics are completely absent from the Prometheus scrape. This severs the causal chain from GPU memory pressure to vLLM KV cache behavior — there is no way to correlate approaching memory limits with queue preemptions. The DCGM ServiceMonitor was created this run, but enabling fields 252/253 requires a ConfigMap edit and exporter restart (two-minute procedure).
Gap 03
No alert for pod absence or metric staleness: SLO burn-rate rules are blind to gateway-layer failures
No PrometheusRule fires on pod absence, zero-endpoint services, or stale vLLM metric series. The SLO burn-rate rules added this run fire on
vllm:request_failure_total rate — which never increments when the pod is gone and requests fail at the Envoy layer before reaching vLLM. LLMD-3 proved this explicitly: full pod death produced zero alert events.
Infrastructure Changes Applied This Run
- DCGM ServiceMonitor created — Prometheus now scrapes
nvidia-dcgm-exporterport 9400 in thegpu-operatornamespace - Tempo deployed to
monitoringnamespace (tempo.monitoring) and added as Grafana datasource; gRPC port 4317 and HTTP 4318 active - token-labs-slos PrometheusRules applied — multi-burn-rate availability alerts for Tier 2 (99.9%) across 1h/6h/1d windows
- Envoy Gateway OTel config patch applied — 100% sampling rate, OpenTelemetry backend configured (trace export verification pending h2c fix)
- deepseek-r1-7b vLLM manifest patched — added
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAME, and--otlp-traces-endpointto pod YAML; takes effect on next pod restart - deepseek-r1-7b-vllm-leader pod restarted after LLMD-3 force-delete simulation (restartPolicy: Never required manual delete/recreate)
Next Run Scope
-
Close the trace gap first. Verify a real waterfall trace in Grafana Tempo from Envoy ingress through vLLM to completion. Diagnose by checking
kubectl get envoyproxy -n token-labs -o yaml, confirmOTEL_EXPORTER_OTLP_ENDPOINTresolves to the correct Tempo OTLP port (4317 vs 4318), and send a test request withtraceparentheader while watching Tempo. Fix the h2c mismatch (switch to HTTP/1.1 on port 4318, or enable TLS on gRPC listener). -
Enable chunked prefill and measure the TTFT p95 delta. Add
--enable-chunked-prefill --max-num-batched-tokens 2048to deepseek and llama vLLM serving args, then re-run the 20-minute concurrent bucket load test and compare p95 TTFT per bucket against this baseline. Target: close the 14.8s Bucket A gap. -
Deploy pod-absence and metric-staleness alert rules. Apply four PromQL rules from
llmd_failure_sim.md:TokenLabsPodAbsent,VLLMMetricsAbsent,TokenLabsServiceNoEndpoints, andSLOBurnRateGatewayLayer. Wrap all inference pods in Deployments. Validate each rule fires correctly by re-running the LLMD-3 force-delete simulation.
Raw Artifacts
overnight_run_summary.md
trace_coverage_report.md
dcgm_metric_inventory.md
causal_chain_stats.md
dcgm_otel_correlation.json
llmd_boundary_instrumentation.md
llmd_queue_metrics.md
llmd_failure_sim.md
baseline_latency_distributions.json
slo_spec.md
failure_kv_pressure.md
failure_silent_truncation.md
Artifacts are local to the controller node at /home/nvidia/overnight-run/2026-04-15/