Token Labs - Accuracy Results

📐 Methodology

What is IFEval?

IFEval (Instruction Following Evaluation) is a benchmark that measures how well language models follow specific instructions in prompts. It tests models on a variety of instruction types including formatting requirements, length constraints, keyword usage, and structural patterns.

Evaluation Metrics

Prompt-level Accuracy: Percentage of prompts where all instructions were followed correctly
Instruction-level Accuracy: Percentage of individual instructions followed correctly across all prompts
Number of Samples: Total number of test prompts evaluated (50 for quick tests, 541 for full evaluation)

Test Configuration

Framework: IFEval benchmark suite
Model Server: vLLM on NVIDIA DGX Spark
Quick Test: 50 samples (~5 minutes)
Full Test: 541 samples (~30 minutes)

📄 View Raw IFEval Data • 📊 View Complete Benchmark Data