Abstract

The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits—changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models—representative of specific, controllable attributes—and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.

Method

After generating a set of N images using StyleGAN, denoted as G(s), and their edited versions, denoted as G(s + Δs), our framework learns a latent direction $d$ that reflects the edits introduced by Δs (e.g. beard) to the pre-trained diffusion model. To effectively learn such a latent direction, we utilize both the denoising network used by the diffusion model, and the CLIP Image Encoder.

Directions Transferred by GANTASTIC

GANTASTIC successfully transfers editing directions that modify the overall look, including changes in race or aging, as well as more detailed edits that target specific facial attributes, such as eyeglasses or a beard. GANTASTIC can also distinguish among various edits for the same feature underlines the versatility of our approach, providing users with an extensive selection of editing options for individual characteristics, like multiple smile designs or styles of baldness.

Capabilities of GANTASTIC

The proposed framework can successfully learn latent directions from a variety of domains including human faces and dog images. Additioanlly, GANTASTIC enables users to adjust the intensity of the editing effect through a scaling parameter. This functionality gives users the flexibility to either tone down or intensify the impact of a given editing direction. For instance, in the case of the gender edit, users can lessen the effect for a more masculine appearance or enhance it for a more feminine look by applying a negative or positive scale, respectively.

Comparisons with StyleGAN edits

We demonstrate the beard, gender, race and baldness edits above, along with reference images from the training datasets constructed using StyleGAN. Above, we show the images generated by StyleGAN as G(s) and their edited counter-parts as G(s + Δs), respectively. As our qualitative results also show, edits learned by GANTASTIC successfully translates the disentangled directions performed on StyleGAN to Stable Diffusion.

Editing Complex Scenes

We demonstrate edits performed by GANTASTIC framework on full-body images (Row 1), and images with multiple faces (Row 2). Additionally, for the example with multiple faces (Row 2), editing the gender attribute amplifies the feminine traits in both of the faces.

BibTeX


@misc{dalva2024gantastic,
  title={GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models}, 
  author={Yusuf Dalva and Hidir Yesiltepe and Pinar Yanardag},
  year={2024},
  eprint={2403.19645},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models