Token Labs
Benchmark Dashboard

Inference performance on NVIDIA DGX Spark — real data, open lab.

DGX Spark GB10 · Grace Blackwell 128 GB Unified Memory SM121 Updated 2026-04-15
Total Experiments
14
Completed
2
In Progress / Pending
4
Queued
6
Failed
2
Experiment Registry

All planned and completed experiments. Click any row to expand hypothesis and notes.

Experiment Framework Status Throughput
(tok/s, c=32)
TTFT p50
(ms)
ITL p50
(ms)
Optimization Journey — Ablation Study

Progressive optimization stack on Qwen/Qwen2.5-7B-Instruct · vLLM standalone → llm-d · DGX Spark GB10 · spark-01 · ~150 tok input / 256 tok output · all concurrency levels (c=1, c=8, c=32)

Throughput vs TTFT p50 — Optimization Frontier

Each point = one experiment. Path shows cumulative optimization journey. Select concurrency level above to switch view. Bottom-right = optimal (high throughput, low TTFT).

Output Throughput (tok/s) — all concurrency levels
TTFT p50 (ms) — all concurrency levels — lower is better
Technique Impact — Incremental Gain per Step

Marginal throughput and TTFT improvement each optimization adds vs the previous step. Model: Qwen/Qwen2.5-7B-Instruct · spark-01 GB10 · c=32 · ~150 tok input / 256 tok output

Showing estimated deltas — actual results pending
ISL/OSL Performance Sweep

Throughput vs latency curves across concurrency levels (c=1,4,8,16,32) and input/output length combinations. Model: Qwen/Qwen2.5-7B-Instruct · Full-stack config (exp9) · DGX Spark GB10 · spark-02

ISL / OSL
Y-Axis Metric
X-Axis
Nemotron-120B NVFP4 — ISL/OSL Performance Sweep

Throughput vs latency across concurrency levels (c=1,2,4,8) and input/output length combinations. Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · NVFP4 (Marlin backend, fp8 KV cache) · vllm-cu130-nightly · DGX Spark GB10 · spark-01 · single B200 · 2026-04-15

ISL / OSL
Y-Axis Metric
X-Axis
Peak Numbers at Max Concurrency (c=8)
ISL/OSL Peak Throughput TTFT p50 (c=1) TTFT p99 (c=8)
ISL128/OSL128 79.7 tok/s 341 ms 1,127 ms
ISL512/OSL256 64.1 tok/s 538 ms 3,373 ms
ISL1024/OSL512 57.6 tok/s 813 ms 6,083 ms
ISL2048/OSL512 49.0 tok/s 1,663 ms 11,990 ms
Qwen3.5-27B GPTQ-Int4 — ISL/OSL Performance Sweep

Throughput vs latency across concurrency levels (c=1,8,32) and input/output length combinations. Model: Qwen/Qwen3.5-27B-GPTQ-Int4 · GPTQ-Int4 (Marlin backend, enforce-eager) · vllm-cu130-nightly · DGX Spark GB10 · spark-01 · single B200 · 2026-04-17

ISL / OSL
Y-Axis Metric
X-Axis
Peak Numbers at Max Concurrency (c=32)
ISL/OSL Peak Throughput TTFT p50 (c=1) ITL p50 (c=32)
ISL1024/OSL1024 104.1 tok/s 1,197 ms 179 ms
ISL4096/OSL1024 83.1 tok/s 3,278 ms 207 ms
ISL1024/OSL4096 59.6 tok/s 459 ms 134 ms
Serving Framework Comparison

SGLang vs TensorRT-LLM (PyTorch backend) vs llm-d (disaggregated vLLM) on identical workloads. Model: Qwen/Qwen2.5-7B-Instruct · TP=1 · DGX Spark GB10 · 5 ISL/OSL combos × 5 concurrency levels

vLLM Optimization Waterfall

Throughput delta relative to baseline for each optimization. Green = improvement, red = regression. Model: Qwen/Qwen2.5-7B-Instruct · DGX Spark GB10 · enforce_eager → CUDA graphs

Data pending — optimization experiments running