This repository is the official implementation of IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation.
IF-RewardBench is a comprehensive meta-evaluation benchmark to assess the capability of judge models in instruction-following evaluation. It comprises 842 diverse instructions covering single-turn interaction, multi-turn interaction, and system-prompt steerability scenarios. For each instruction, our benchmark constructs a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality, enabling a listwise evaluation paradigm to assess the ranking capabilities of judge models.
The data of IF-RewardBench is provided in data/. The format of an example is as follows:
id(integer): A unique identifier for this example.response_generation_model(string): The response generation model of this example.instruction_type(string): The instruction type of this example.messages(list): The user instruction, which may contain the system prompt and conversation history.checklist(list): The constraint checklist for the user instruction.constraint_type(list): The constraint categories and constraint composition types in each constraint of the checklist.responses(list): All responses and their corresponding instruction-following judgements for each constraint.preference_graph(list): All preference relations among these responses.
We provide the codes for the judge model inference on constraint assessment and overall assessment in inference/, which are based on the vLLM framework.
Constraint Assessment
cd inference/
python constraint_assessment_inference_vllm.py --model_name <model_name> --model_path <model_path>Overall Assessment
cd inference/
python overall_assessment_inference_vllm.py --model_name <model_name> --model_path <model_path>As we have randomly shuffled the positions of candidate responses in pairwise comparison, we place this position information in inference/position_maps.json.
We provide the codes for the evaluation metrics calculation on constraint assessment and overall assessment in metrics/.
Constraint Assessment
cd metrics/
python analysis_constraint_assessment.pyOverall Assessment
cd metrics/
python analysis_overall_assessment.py@article{wen2026if,
title={IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation},
author={Wen, Bosi and Niu, Yilin and Wang, Cunxiang and Ling, Xiaoying and Zhang, Ying and Ke, Pei and Wang, Hongning and Huang, Minlie},
journal={arXiv preprint arXiv:2603.04738},
year={2026}
}
Please kindly cite our paper if this paper and the codes are helpful.