← Back to Token Labs
Benchmark Results
vLLM Performance on NVIDIA DGX Spark
Auto-updated via CI/CD
Loading benchmark results...
Methodology
Hardware
Platform:
NVIDIA DGX Spark
GPU:
Grace Hopper (~120GB GPU memory)
Architecture:
ARM64
TDP:
~150W per node
Benchmark Configuration
Prefill Test:
100 prompts, 3072 input tokens, 1024 output tokens (3:1 ratio)
Cache Test:
Enabled (LMCache with CPU offload) - 100 prompts with prefix repetition
Decode Test:
100 prompts, 1024 input tokens, 3072 output tokens (1:3 ratio)
Request Rate:
10 requests/second
Tool:
vllm bench serve
Metrics
Output Speed (tok/s):
Decode throughput — total output tokens generated per second. Higher = faster responses for users.
Input/Output Cost:
(DGX hourly cost × 1,000,000) ÷ (tokens/sec × 3600). DGX Spark: $4000 ÷ 26,280 hours (3yr @ 30% utilization) + electricity ≈ $0.07/hour
Quality (IFEval):
Instruction-following accuracy measured on 541 prompts from the IFEval benchmark. Higher = better at following user instructions.
View Raw JSON Data