← Back to Token Labs

Benchmark Results

vLLM Performance on NVIDIA DGX Spark

Auto-updated via CI/CD
Loading benchmark results...

Methodology

Hardware

  • Platform: NVIDIA DGX Spark
  • GPU: Grace Hopper (~120GB GPU memory)
  • Architecture: ARM64
  • TDP: ~150W per node

Benchmark Configuration

  • Prefill Test: 100 prompts, 3072 input tokens, 1024 output tokens (3:1 ratio)
  • Cache Test: Enabled (LMCache with CPU offload) - 100 prompts with prefix repetition
  • Decode Test: 100 prompts, 1024 input tokens, 3072 output tokens (1:3 ratio)
  • Request Rate: 10 requests/second
  • Tool: vllm bench serve

Metrics

  • Output Speed (tok/s): Decode throughput — total output tokens generated per second. Higher = faster responses for users.
  • Input/Output Cost: (DGX hourly cost × 1,000,000) ÷ (tokens/sec × 3600). DGX Spark: $4000 ÷ 26,280 hours (3yr @ 30% utilization) + electricity ≈ $0.07/hour
  • Quality (IFEval): Instruction-following accuracy measured on 541 prompts from the IFEval benchmark. Higher = better at following user instructions.
View Raw JSON Data