What is IFEval?
IFEval (Instruction Following Evaluation) is a benchmark that measures how well language models
follow specific instructions in prompts. It tests models on a variety of instruction types including
formatting requirements, length constraints, keyword usage, and structural patterns.
Evaluation Metrics
- Prompt-level Accuracy: Percentage of prompts where all instructions were followed correctly
- Instruction-level Accuracy: Percentage of individual instructions followed correctly across all prompts
- Number of Samples: Total number of test prompts evaluated (50 for quick tests, 541 for full evaluation)
Test Configuration
- Framework: IFEval benchmark suite
- Model Server: vLLM on NVIDIA DGX Spark
- Quick Test: 50 samples (~5 minutes)
- Full Test: 541 samples (~30 minutes)