Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.
We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach leverages a few-to-many identity reconstruction training paradigm, where a few images of an individual serve as input to reconstruct multiple target images of the same individual in varied poses and expressions. To train the Omni-ID encoder, we use a multi-decoder framework that leverages the complementary strengths of different decoders during the representation learning phase. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.
We propose a multi-face generative encoder that consolidates information from multiple unstructured input images into a structured representation, capturing both global and local identity features.
We introduce a novel few-to-many identity reconstruction training paradigm with a multi-decoder framework that leverages complementary strengths of different decoders during representation learning.
Through comprehensive evaluations, we demonstrate that Omni-ID achieves state-of-the-art identity preservation and substantial improvements over conventional representations across various generative tasks.
Generating images that faithfully represent an individual's identity requires a face encoding capable of depicting nuanced details across diverse poses and facial expressions. Existing facial representations fall short in generative tasks due to (1) their reliance on single-image encodings, which fundamentally lack comprehensive information about an individual's appearance, and (2) their optimization for discriminative tasks, which fail to preserve the subtle nuances that define a person's unique identity, particularly across varying poses and expressions.
We introduce a new face representation named Omni-ID, featuring an Omni-ID Encoder and a novel few-to-many identity reconstruction training with a multi-decoder objective. Designed for generative tasks, this representation aims to enable high-fidelity face generation in diverse poses and expressions, supporting a wide array of generative applications.
Omni-ID uses a few-to-many identity reconstruction training paradigm that not only reconstructs the input images but also a diverse range of other images of the same identity in various contexts, poses, and expressions. This strategy encourages the representation to capture essential identity features observed across different conditions while mitigating overfitting to specific attributes of any single input image.
Omni-ID employs a multi-decoder training objective that combines the unique strengths of various decoders, such as improved fidelity or reduced identity leakage, while mitigating the limitations of any single decoder. This enables leveraging the detailed facial information present in the input images to the greatest feasible degree and results in a more robust encoding that effectively generalizes across various generative applications.
Omni-ID Encoder receives a set of images of an individual, projecting them into keys and values for cross-attention layers. These layers attend to learnable queries that are semantic-aware, allowing the encoder to capture shared identity features across images. Self-attention layers refine these interactions further, producing a holistic representation.
SIGGRAPH Asia 2025
A human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
Read Paper →arXiv 2025
LayerComposer enables Photoshop-like control for multi-subject text-to-image generation, allowing users to compose scenes by placing, resizing, and locking elements in a layered canvas with high fidelity.
Read Paper →@inproceedings{qian2025omni,
title={Omni-id: Holistic identity representation designed for generative tasks},
author={Qian, Guocheng and Wang, Kuan-Chieh and Patashnik, Or and Heravi, Negin and Ostashev, Daniil and Tulyakov, Sergey and Cohen-Or, Daniel and Aberman, Kfir},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={8786--8795},
year={2025}
}