Abstract
The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation.
LongPrompt-3K
LongPrompt-3K. We collect long prompts from human-generated content (Human-Gen), AI-generated content (AI-Gen), and long image captions (Img-Cap), while controlling for length distribution across different word count intervals. Given the detail-intensive nature of these prompts, we apply Textual Graph Transformation to the collected data. This establishes the foundation for more objective and accurate text-image alignment annotations.
LongT2IBench-14K
LongT2IBench-14K. The pipeline consists of three stages: (a) Data Preparation. Long prompts are collected from three sources and input into various T2I models to generate images. (b) Data Annotation. Long prompts are converted into textual graph structures, and fine-grained image-textual graph alignment annotations are achieved. (c) Label Generation. Two categories of labels: quantitative alignment scores and alignment interpretations are produced based on graph-structured human annotations.
Visualization of Entity Alignment, Attribute Alignment, and Relation Alignment in a Long Prompt-Image Pair.
LongT2IExpert
Overall pipeline of the proposed LongT2IExpert. A Hierarchical Alignment Chain-of-Thought (CoT) is designed to instruct MLLMs for structured alignment reasoning. Numerical alignment scores and graph-structured interpretations are utilized to train MLLMs for alignment scoring and interpreting in a multi-task manner.
Performance comparison between the proposed LongT2IExpert and the state-of-art methods for Long T2I Alignment Scoring. Here, '*' denotes the fine-tuned versions of corresponding methods using the training data from LongT2IBench. The SRCC and PLCC across five word-count intervals are reported. Best in bold, second underlined.
Comparison between the proposed LongT2IExpert and the advanced multimodal large language models (MLLMs) for Long T2I Alignment Interpreting. Aligned and Misaligned accuracy across entity, attribute and relation are reported.
BibTeX
@misc{yang2025longt2ibenchbenchmarkevaluatinglong,
title={LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations},
author={Zhichao Yang and Tianjiao Gu and Jianjie Wang and Feiyu Lin and Xiangfei Sheng and Pengfei Chen and Leida Li},
year={2025},
eprint={2512.09271},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.09271},
}