RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

ICML 2025

1The Hong Kong University of Science and Technology (Guangzhou), 2Guangxi Zhuang Autonomous Region Big Data Research Institute, 3Shanghai Jiao Tong University 4The Hong Kong University of Science and Technology
Corresponding Author
ylyu650@connect.hkust-gz.edu.cn

Abstract

Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck.

To this end, we present the first real-object-based retrieval-augmented generation framework RealRAG, which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation.

Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18% FID score with the auto-regressive model on the Stanford Car benchmark.

Method

RealRAG Framework

An overview of our RealRAG. (a) The pipeline of the real-object-based RAG. We propose the first real-object-based retrieval-augmented generation framework, which leverages real-world images to compensate for the knowledge gap inherent in generative models and augment realistic image generation, (b) The framework of self-reflective contrastive learning, which injects the generator's knowledge into the self-reflective negatives ensuring that the retrieved images compensate for the model's missing knowledge.

Fine-grained Object Generation

RealRAG fine-grained object generation results

The visual results of fine-grained object generation. We visually compare the images generated by the original generators and our RealRAG. We also add real-world images for reference.

Unseen Novel Object Generation

RealRAG unseen novel object generation results

The visual results of unseen novel object generation