Scientific Image Reasoning Benchmark

SciIR-Bench Leaderboard

SciIR-Bench evaluates scientific image generation across intrinsic reasoning and instruction following. The benchmark focuses on scientific laws, entity structures, scientific processes, and strict text rendering, using checklist-based Yes/No judgments to report accuracy scores.

Benchmark

SciIR-Bench is designed to evaluate whether image generation models can produce scientifically faithful figures, not just visually plausible illustrations. It decomposes scientific image generation into Scientific Law, Entity Structure, Scientific Process, and Text Rendering, then scores each sample with checklist-based judgments for intrinsic reasoning and instruction following.

IR

Intrinsic Reasoning

Measures whether generated scientific figures preserve scientific constraints, structures, and causal logic.

IF

Instruction Following

Measures whether the image follows prompt-level visual, layout, label, and content requirements.

Avg

Track Average

Reports the average score for each track, grouped into SL, ES, SP, and Text columns.

Final

Final Score

Ranks models by the overall accuracy score reported for the benchmark.

Task Classification

Each scientific sample can activate one or more scientific dimensions; text rendering is scored as a strict auxiliary track.

Model Leaderboard

Scores are Accuracy Score (%) for Intrinsic Reasoning (IR), Instruction Following (IF), track average, and final performance.

Rank Model Type SL (%) ES (%) SP (%) Text (%) Final (%)
IR IF Avg. IR IF Avg. IR IF Avg. IR IF Avg.

Dataset Samples

The released dataset can follow the same schema used by the evaluator: prompt text, active reasoning tracks, rendered text requirements, and checklist questions. The cards below are publishable examples of the expected item shape.