Inference performance on NVIDIA DGX Spark — real data, open lab.
All planned and completed experiments. Click any row to expand hypothesis and notes.
| Experiment | Framework | Status | Throughput (tok/s, c=32) |
TTFT p50 (ms) |
ITL p50 (ms) |
|---|
Progressive optimization stack on Qwen/Qwen2.5-7B-Instruct · vLLM standalone → llm-d · DGX Spark GB10 · spark-01 · ~150 tok input / 256 tok output · all concurrency levels (c=1, c=8, c=32)
Each point = one experiment. Path shows cumulative optimization journey. Select concurrency level above to switch view. Bottom-right = optimal (high throughput, low TTFT).
Marginal throughput and TTFT improvement each optimization adds vs the previous step. Model: Qwen/Qwen2.5-7B-Instruct · spark-01 GB10 · c=32 · ~150 tok input / 256 tok output
Throughput vs latency curves across concurrency levels (c=1,4,8,16,32) and input/output length combinations. Model: Qwen/Qwen2.5-7B-Instruct · Full-stack config (exp9) · DGX Spark GB10 · spark-02
Throughput vs latency across concurrency levels (c=1,2,4,8) and input/output length combinations. Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · NVFP4 (Marlin backend, fp8 KV cache) · vllm-cu130-nightly · DGX Spark GB10 · spark-01 · single B200 · 2026-04-15
| ISL/OSL | Peak Throughput | TTFT p50 (c=1) | TTFT p99 (c=8) |
|---|---|---|---|
| ISL128/OSL128 | 79.7 tok/s | 341 ms | 1,127 ms |
| ISL512/OSL256 | 64.1 tok/s | 538 ms | 3,373 ms |
| ISL1024/OSL512 | 57.6 tok/s | 813 ms | 6,083 ms |
| ISL2048/OSL512 | 49.0 tok/s | 1,663 ms | 11,990 ms |
Throughput vs latency across concurrency levels (c=1,8,32) and input/output length combinations. Model: Qwen/Qwen3.5-27B-GPTQ-Int4 · GPTQ-Int4 (Marlin backend, enforce-eager) · vllm-cu130-nightly · DGX Spark GB10 · spark-01 · single B200 · 2026-04-17
| ISL/OSL | Peak Throughput | TTFT p50 (c=1) | ITL p50 (c=32) |
|---|---|---|---|
| ISL1024/OSL1024 | 104.1 tok/s | 1,197 ms | 179 ms |
| ISL4096/OSL1024 | 83.1 tok/s | 3,278 ms | 207 ms |
| ISL1024/OSL4096 | 59.6 tok/s | 459 ms | 134 ms |
SGLang vs TensorRT-LLM (PyTorch backend) vs llm-d (disaggregated vLLM) on identical workloads. Model: Qwen/Qwen2.5-7B-Instruct · TP=1 · DGX Spark GB10 · 5 ISL/OSL combos × 5 concurrency levels
Throughput delta relative to baseline for each optimization. Green = improvement, red = regression. Model: Qwen/Qwen2.5-7B-Instruct · DGX Spark GB10 · enforce_eager → CUDA graphs