MyVLM

Personalizing VLMs for User-Specific Queries

Yuval Alaluf
1,2   
Elad Richardson
2   
Sergey Tulyakov
1   
Kfir Aberman
1   
Daniel Cohen-Or
1,2


1Snap Inc.    
2Tel Aviv University   


"<your-dog> and a black dog running on the grass."

Introducing MyVLM

Given a set of images depicting user-specific concepts such as <you>, <your-dog>, and <your-friend> (left), we teach a pretrained vision-language model (VLM) to understand and reason over these concepts. First, we enable the model to generate personalized captions incorporating the concept into its output text (middle). We further allow the user to ask subject-specific questions about these concepts, querying the model with questions such as "What are <you> doing?" or "What is my <your-friend> wearing?" (right).

Background

LLMs offer users intuitive interfaces for interacting with textual information. The integration of vision into LLMs through VLMs has enabled these models to "see" and reason over visual content. However, these VLMs possess generic knowledge, lacking a personal touch. With MyVLM we equip these models with the ability to comprehend user-specific concepts, tailoring the model specifically to you. MyVLM allows users to obtain personalized responses where outputs are no longer generic, but focus on communicating information about the target subject to the user.

"<you>, wearing sunglasses and a yellow strap, standing on a bustling street in a colorful city."

The Vision Language Models

We apply MyVLM to various VLM architectures for personalized captioning, visual question-answering, and referring expression comprehension.

BLIP-2

LLaVA 1.6

MiniGPT-v2

How Does It Work?

The VLM

We use a pretrained, frozen vision-language model to maintain the general capabilities of the model.

Step 1: Feature Extraction

Given an image, we extract the frozen image features from the VLM's vision encoder.

Step 2: Recognizing the Concept

We utilize a set of concept heads, each designed to recognize the presence of a user-specific concept within the image.

Step 3: Communicating the Concept

We train a concept embedding to represent the concept and guide the LLM to incorporate the concept into its personalized response.

Results

  •    For each row, we show sample images of the target concept to the left
  •   The remaining images represent the input images passed to MyVLM
  •    Personalized responses generated by MyVLM can be seen by hovering over each image
  •   S* represents our concept's name

Personalized Captioning

Hover over the images to see the personalized captions!

"S*, dressed in a blue jacket and a green sweater, takes a selfie with his friends, who are also bundled up against the chilly weather..."
"S*, in a blue suit, poses for a portrait at a scenic spot overlooking a river with a bridge in the distance"
"As S* takes a break from his day, S* takes a moment to capture the moment"
"S*, laughing and enjoying her drink, is wearing a white t-shirt with the word ”Angels” and the year ”1961” on it. She’s also wearing sunglasses"
"S* enjoying a warm beverage at a cafe, surrounded by the hustle and bustle of city life"
"S* and her friends are enjoying a sunny day outdoors. They are dressed in summer attire, with S* wearing a white top and sunglasses."
"S* and a friend at a restaurant, holding up their glasses of orange wine"
"Sitting at a table, S* and her friend smile at each other as they clink their wine glasses together"
"In her living room, S* and two friends are dressed in party hats and mustaches."
"S* stands on a dock, smiling, with boats in the background, wearing a white t-shirt and sunglasses, overlooking a lake, with mountains in the distance."
"In the forest, S* and his friend take a selfie, surrounded by the lush greenery. They are both smiling and wearing sunglasses"
"S* is sitting in the car, wearing sunglasses and a green shirt, giving a thumbs up."
"S* walking on a leash in a park near the beach with palm trees in the background."
"S* and another dog are lying down in a dog bed in the floor"
"A sad S* wearing a cone collar, sitting on a white tile floor"
"S* standing in front of a rainbow flag and a potted plant with a white wall in the background."
"S* standing next to a pink box with the words "Pink series" on it. The cat is striped with multiple colors and has a long, colorful tail."
"S* standing next to a whiteboard with mathematical equations written on it."
"S* sits comfortably on the second shelf of an open refrigerator, ready to be stocked with a variety of food and drink items"
"A S* sitting on a window sill with a bunch of cactus and a windmill in the background."
"S* standing on a white shelf next to two mugs and a glass."
"S* sitting next to a colorful children's book on a table."
"S* sitting in front of a laptop on a wooden table with a question about how to write papers fast and efficiently?"
"S* sitting on a bookshelf filled with books and yarn."

Personalized Visual Question-Answering

: Can you describe what S* is wearing?
: "S* is wearing a brown sweater and has curly hair."

: Where is S* positioned in the image?
: "In the image, S* is positioned on the right side of the table"
: Can you describe what S* is wearing?
: "In the image, S* is wearing a white sleeveless top"

: On what side of the image is S*?
: "S* is on the right side of the image"
: Can you describe what S* is wearing?
: "S* is wearing a white tank top and black shorts."

: On what side of the image is S*?
: "S* is on the right side of the image"
: From left to right, where is S* positioned in the image?
: "S* is standing on the right side of the photo."

: What color pants is S* wearing?
: "S* is wearing red pants."
: What is S* wearing?
: "S* is wearing a denim jacket and a necklace"

: Where is S* in the image?
: "S* is sitting at a table with the other two people, likely enjoying a meal or a drink together."
: Is S* small or large relative to the other objects?
: "S* is small relative to the other objects."

: Where is S* positioned in image?
: "S* is on top of the refrigerator, sitting on a shelf with food items and containers."
: What is S* sitting in?
: "S* is sitting in a green bowl."

: From left to right, where is S* located?
: "S* is located on the right side, sitting in a green bowl next to a wooden bowl containing oranges and pears."
: On what shelf is S* sitting?
: "S* is sitting on the top shelf of the pantry."

: What color is S*?
: "The color S* is white."

Personalized Referring Expression Comprehension

Hover over the images to see the personalized captions!

"S* sitting on a book shelf next to a stack of books."
"S* next to a cup of coffee that says coffee on it."
"S* and a black dog walking towards each other in a garden."
"S* and her friend sitting outside at a table with drinks."
"S* and his friend are standing on the balcony of their apartment in New York City..."
"S* and her dog, with another dog and its owner nearby."

Acknowledgements

This research was performed while Yuval Alaluf was at Snap.

We would like to thank Assaf Ben-Kish, Jackson Wang, Moran Yanuka, Morris Alper, Or Patashnik, Yonatan Biton, and Yuwei Fang for their fruitful discussions and valuable input which helped improve this work.

BibTeX


@misc{alaluf2024myvlm,
      title={MyVLM: Personalizing VLMs for User-Specific Queries},
      author={Yuval Alaluf and Elad Richardson and Sergey Tulyakov and Kfir Aberman and Daniel Cohen-Or},
      year={2024},
      eprint={2403.14599},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
							

Template created by Yuval Alaluf, based on HTML5up Hyperspace. Feel free to reuse.