Given a set of images depicting user-specific concepts such as <you>, <your-dog>, and <your-friend> (left), we teach a pretrained vision-language model (VLM) to understand and reason over these concepts. First, we enable the model to generate personalized captions incorporating the concept into its output text (middle). We further allow the user to ask subject-specific questions about these concepts, querying the model with questions such as "What are <you> doing?" or "What is my <your-friend> wearing?" (right).
LLMs offer users intuitive interfaces for interacting with textual information. The integration of vision into LLMs through VLMs has enabled these models to "see" and reason over visual content. However, these VLMs possess generic knowledge, lacking a personal touch. With MyVLM we equip these models with the ability to comprehend user-specific concepts, tailoring the model specifically to you. MyVLM allows users to obtain personalized responses where outputs are no longer generic, but focus on communicating information about the target subject to the user.
We use a pretrained, frozen vision-language model to maintain the general capabilities of the model.
Given an image, we extract the frozen image features from the VLM's vision encoder.
We utilize a set of concept heads, each designed to recognize the presence of a user-specific concept within the image.
We train a concept embedding to represent the concept and guide the LLM to incorporate the concept into its personalized response.
Hover over the images to see the personalized captions!
Hover over the images to see the personalized captions!
This research was performed while Yuval Alaluf was at Snap.
We would like to thank Assaf Ben-Kish, Jackson Wang, Moran Yanuka, Morris Alper, Or Patashnik, Yonatan Biton, and Yuwei Fang for their fruitful discussions and valuable input which helped improve this work.
@misc{alaluf2024myvlm,
title={MyVLM: Personalizing VLMs for User-Specific Queries},
author={Yuval Alaluf and Elad Richardson and Sergey Tulyakov and Kfir Aberman and Daniel Cohen-Or},
year={2024},
eprint={2403.14599},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Template created by Yuval Alaluf, based on HTML5up Hyperspace. Feel free to reuse.