Given a reference image and a textual attribute description, Omni-Attribute encodes a high-fidelity, attribute-specific representation, while suppressing other visual concepts. This enables coherent synthesis of the user-specified attributes in new contexts in a fully feed-forward manner, without any test-time optimization.
We show the reference image at the left, the reference attributes in the colored boxes, and the generated image below its conditional reference attribute and text prompt.
All images shown together were generated using the same reference image and the same text prompt.
The learned attribute embeddings are composable, enabling the seamless integration of multiple image attributes into one coherent generated image.
The reference images (left) and the reference attributes (top colored boxes) with the same border color are paired as conditional image–attribute inputs.
Click the image-attribute pairs below to enable or disable its conditional effect. Multiple selections are supported.
"A vase is standing against a plain background."
We demostrate the practical utility of Omni-Attribute in four real-world application scenarios.
Click the cards below for more details.
We compare Omni-Attribute with two groups of baselines for open-vocabulary attribute personalization:
image encoder–based approaches (middle-top) and image editing models (middle-bottom).
Omni-Attribute achieves the best balance between faithfully encoding the target attribute and coherently synthesizing it into new contexts aligned with the prompt.
For encoder-based approaches, we insert IP-Adapter modules between the encoder and the generator to support personalization. The generator then receives the prompt shown below for generation.
For image-editing methods, we combine the reference attribute and the text prompt into a single editing instruction formatted as: "Preserve the [attribute] of the image and generate [prompt]."
More details can be found in the technical report.
We compare the models on the personalization of two types of attributes: concrete objects and abstract concepts.
We evaluate two metrics: image naturalness and conditioning fidelity (both higher is better), using both MLLM and human evaluations.
Omni-Attribute consistently outperforms baselines in generating coherent and condition-aligned outputs.
Conditioning fidelity is computed as the average of the text fidelity score and the attribute fidelity score.
The benchmark contains 15 reference attributes, including 5 concrete objects and 10 abstract concepts, with 25 test samples per attribute.
We visualize the embedding spaces of the same 60 animal images across three different attributes.
We show that this same set of images is distributed differently and meaningfully across varying attributes.
We using t-SNE to project 4096-dimensional pooled attribute embeddings into two dimensions for visualization.
The attribute embeddings enable image retrieval based on a specified attribute.
Omni-Attribute surpasses the performance of the text-guided retrieval baseline using GPT-4o and CLIP.
We retrieve the images based on the cosine similarity of the pooled attribute embeddings.