ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

1Beijing Jiaotong University
2ByteDance Inc.
teaser image

Stroy of given dog and sunglasses, showing how a dog win the Nobel prize for literature, and the fate of a sunglasses.

Abstract

Recent text-to-image customization works have been proven successful in generating images of given concepts by fine-tuning the diffusion models on a few examples. However, these methods tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g. headphone is missing when generating a dog wearing a headphone'). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g. a dog wearing a headphone) implying that the compositional ability only disappears after personalization tuning. Inspired by this observation, we present ClassDiffusion, a simple technique that leverages a semantic preservation loss to explicitly regulate the concept space when learning the new concept. Despite its simplicity, this helps avoid semantic drift when fine-tuning on the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of the fine-tune models. In response to the ineffective evaluation of CLIP-T metrics, we introduce BLIP2-T metric, a more equitable and effective evaluation metric for this particular domain. We also provide in-depth empirical study and theoretical analysis to better understand the role of the proposed loss. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

Single Concept Comparison

Inference Overview

Qualitative comparison between our method and baselines with single given concept.

Multiple Concepts Comparison

Inference Overview

Qualitative comparison between our method and custom diffusion(CD) with multiple given concept.

Personalized Video

Emperical Analysis

Inference Overview

(a) Each dot represents the CLIP text embedding of a phrase combining an adjective and " dog " (e.g., a cute dog). After fine-tuning, customized concepts (the blue dot represents the concept before fine-tuning, and the red dot represents the one after) become far away from the center of the "dog" distribution in the text feature space. (b) Visualization results of cross-attention maps corresponding to the dog token when using the prompt "a photo of a dog swimming in the pool ".

Theoretical Analysis

Inference Overview

During the personalization tuning process, as the distribution of dogs shrinks, the conditional distribution of dogs and headphones also shrinks. This gradually increases the difficulty to sample in this distribution, leading to the weakening of the compositional generation capability. Our ClassDiffusion mitigates this by incorporating semantic preservation loss (SPL) to minimize the semantic drift of the personalized concept from its superclass.

Method Overview

Inference Overview

Overview of our proposed ClassDiffusion. Our semantic preservation loss (SPL) is calculated by measuring the cosine distance between text features extracted from the same text transformer (using EOS tokens as text features following CLIP) for phrases with personalized tokens and phrases with only superclasses.

BibTeX

@article{huang2024classdiffusion,
  title={ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance},
  author={Huang, Jiannan and Liew, Jun Hao and Yan, Hanshu and Yin, Yuyang and Zhao, Yao and Wei, Yunchao},
  journal={arXiv preprint arXiv:2405.17532},
  year={2024}
}