TokenLabs · Inference API

Model Console

OpenAI-compatible chat completions on dedicated GPU instances. Pick a model, view its pricing, and call it with your own API key.

Model & instance

Model

Context ≈ 8K tokens

GPU RTX A4500 · 20GB

Instance Primary

API endpoint

OpenAI-compatible /v1/chat/completions:

curl https://api.tokenlabs.run/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-NVFP4-QAT",
    "messages": [
      {"role": "user", "content": "Say hello from TokenLabs."}
    ]
  }'

Pricing · per model

Token type	Unit	Price (USD)
Input tokens (prompt)	per 1M	$0.0000
Cached input tokens	per 1M	$0.0000
Output tokens (completion)	per 1M	$0.0000

Prices shown here are per 1M tokens and are updated from CI benchmarks (e.g., vLLM throughput on DGX Spark at $0.26/hr).

📊 View detailed performance benchmarks →

🎯 View accuracy evaluation results →

🔍 View baseline comparison results →

Chat with the selected model

Model: llama-8b-nvfp4

Message Press Enter to send · Shift+Enter for newline