Omni-ID: Holistic Identity Representation Designed for Generative Tasks

paper
Research Paper

Abstract

We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach leverages a few-to-many identity reconstruction training paradigm, where a few images of an individual serve as input to reconstruct multiple target images of the same individual in varied poses and expressions. To train the Omni-ID encoder, we use a multi-decoder framework that leverages the complementary strengths of different decoders during the representation learning phase. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.

TL;DR: Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.

Motivation

Generating images that faithfully represent an individual’s identity requires a face encoding capable of depicting nuanced details across diverse poses and facial expressions. Existing facial representations fall short in generative tasks due to (1) their reliance on single-image encodings, which fundamentally lack comprehensive information about an individual’s appearance, and (2) their optimization for discriminative tasks, which fail to preserve the subtle nuances that define a person’s unique identity, particularly across varying poses and expressions.

Comparison of face generation representations
Face generation comparison of different facial representations with single input (top row) and two inputs (bottom row).

Method

We introduce a new face representation named Omni-ID, featuring an Omni-ID Encoder and a novel few-to-many identity reconstruction training with a multi-decoder objective. Designed for generative tasks, this representation aims to enable high-fidelity face generation in diverse poses and expressions, supporting a wide array of generative applications.

Omni-ID Training Strategy

Omni-ID uses a few-to-many identity reconstruction training paradigm that not only reconstructs the input images but also a diverse range of other images of the same identity in various contexts, poses, and expressions. This strategy encourages the representation to capture essential identity features observed across different conditions while mitigating overfitting to specific attributes of any single input image.

Omni-ID employs a multi-decoder training objective that combines the unique strengths of various decoders, such as improved fidelity or reduced identity leakage, while mitigating the limitations of any single decoder. This enables leveraging the detailed facial information present in the input images to the greatest feasible degree and results in a more robust encoding that effectively generalizes across various generative applications.

Few-to-many identity reconstruction
Omni-ID employs a multi-decoder few-to-many identity reconstruction training strategy.

Omni-ID Encoder

Omni-ID Encoder receives a set of images of an individual, projecting them into keys and values for cross-attention layers. These layers attend to learnable queries that are semantic-aware, allowing the encoder to capture shared identity features across images. Self-attention layers refine these interactions further, producing a holistic representation.

Omni-ID encoder architecture

Experiments

Personalized Text-to-Image Generation (Representation Comparisons)

Our Omni-ID achieves better ID preservation for both single and multiple input images than CLIP.

person1
Qualitative comparisons with different representations in personalized T2I generation. We show results of the same IP-Adapter trained with different representations.

Personalized Text-to-Image Generation (SOTA Comparisons)

Our Omni-ID achieves better ID preservation than other representations, i.e. ArcFace, CLIP. IP-Adapter with our Omni-ID as representation without any regularization outperforms state of the art PuLID.

person1
Qualitative comparisons with the state-of-the-art in personalized T2I generation using FLUX dev as the base model. Our Omni-ID with IP-Adapter without any other regularization (LoRA, ID loss, alignment loss) achieves highest ID preservation. Omni-ID also works well on FLUX Schnell model, which generates each sample by 4 denoising steps.

Personalized Text-to-Image Generation (SD Base Models)

Omni-ID can be used in any diffusion model. Here we show Omni-ID outperforms other representations and the state-of-the-art personalization techniques when using UNet (Stable Diffusion) as the base model.

person1
Qualitative comparisons to the state-of-the-art in personalized T2I generation using Stable Diffusion as the base model. Our IPA-Omni-ID use SD15 as the base model outperforms other representations and the state-of-the-art personalization techniques.

Controllable Face Generation

Our Omni-ID achieves superior identity preservation, captures nuanced details more faithfully, and demonstrates higher adaptivity to diverse poses and expressions.

person1
Qualitative comparisons to the state-of-the-art representations in controllable face generation. We compare Omni-ID with ArcFace and CLIP with 5 input images.

Citation

@article{qian2024omniid, title = {Omni-ID: Holistic Identity Representation Designed for Generative Tasks}, author = {Guocheng Qian and Kuan-Chieh Wang and Or Patashnik and Negin Heravi and Daniil Ostashev and Sergey Tulyakov and Daniel Cohen-Or and Kfir Aberman}, journal = {arXiv preprint}, year = {2024}, }