Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
Schematic illustration of the proposed discriminative probing and tuning (DPT) framework. We first extract semantic representations from the frozen SD and then propose a discriminative adapter to conduct discriminative probing to investigate the global matching and local grounding abilities of SD. Afterward, we perform parameter-efficient discriminative tuning by introducing LoRA parameters. During inference, we present the self-correction mechanism to guide the denoising-based text-to-image generation.
We compare DPT with SD-v2.1 and two baselines including Attend-and-Excite (AaE) and HN-DiffusionITM (HN-DiffITM) regarding object appearance, counting, spatial relation, semantic relation, and compositional reasoning. Categories and the corresponding keywords in prompts are highlighted with different colors.
Qualitative results on CC-500. We compare the proposed method with SD-v1.4 and two baselines including StructureDiffusion and Attend-and-Excite (AaE) regarding object appearance and attribute characterizing.
Qualitative results on ABC-6K. We compare the proposed method with SD-v1.4 and two baselines including StructureDiffusion and Attend-and-Excite (AaE) regarding color attribute characterizing.
@article{qu2024discriminative,
title={Discriminative Probing and Tuning for Text-to-Image Generation},
author={Qu, Leigang and Wang, Wenjie and Li, Yongqi and Zhang, Hanwang and Nie, Liqiang and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2403.04321},
year={2024}
}