Snap Logo

Visual Personalization Turing Test

Rameen Abdal   James Burgess   Sergey Tulyakov   Kuan-Chieh Jackson Wang  
Snap Research Stanford University

TL;DR - We propose the Visual Personalization Turing Test (VPTT), a new paradigm supported by a 10,000-persona benchmark (VPTT-Bench) and a retrieval-augmented generation method (VPRAG) designed to achieve authentic, privacy-safe contextual personalization. This framework is evaluated via the VPTT Score, a scalable proxy for contextual alignment that is calibrated against human judgment.

What is VPTT?

From Simulation to Evaluation. On the left, the complete VPTT Framework: moving from user data simulation to "deferred rendering" - structured, attribute-rich intermediates like lighting , materials, environment, actions, forground, background, appearance etc. that defer visual realization for privacy safe representation - to our VPRAG generation method, and finally to the "Visual Personalization Turing Test" evaluation triangle. On the right, sampled diverse, culturally rich synthetic personas from VPTT-Bench. These profiles serve as the "ground truth" for personalization, allowing us to test if a model can generate images that feel like they belong to a specific person's asset gallery.

VPTT VPTT
VPTT-Bench
VPTT-Bench Visual Assets

VPTT-Bench Data Generation Pipeline

Overview of the deferred rendering pipeline used to construct VPTT-Bench. (1) Personas are sampled from PersonaHub with demographics. (2–3) Visual and scenario elements (lighting, actions, materials etc.) are extracted. (4) These cues are composed into structured captions and embedded via an LLM. (5) Generating 30 corresponding visual assets per persona, forming privacy-safe, semantically grounded data for evaluating contextual personalization.

Data Generation

Diversity

How does VPRAG work?

Comparison between the baseline retrieval-augmented generation (BRAG) and our proposed Visual Personalization RAG (VPRAG). Unlike baseline BRAG, VPRAG introduces controllable and interpretable retrieval through: (a) post-level embedding and similarity scoring, (b) temperature-controlled attention, (c) entropy-guided post selection, (d) capacity-aware quota allocation, (e) category-level ranking, and (f) element-level composition. This multi-stage design yields a white-box, LLM-optional retrieval framework producing visually and semantically aligned personalized generations and edits.

Static graphic

Contextual Image Generation and Editing using VPTT-Bench

Contextual Image Generation and Editing using VPTT-Bench. Each row shows a distinct user profile: assets and style cues (left), personalized generations (social post, cultural site), and edits (garden, living room) guided by the same persona identity. All images are generated synthetically via our Visual Personalization RAG (VPRAG) by text, which retrieves persona-aligned cues. To show cross model personalization here the assets are generated by QWEN-image-model and generations and edits by Nano-Banana conditioned only on the first image.

Contextual Generation

Comparisons

Qualitative Comparison across Generation and Editing Tasks. (Refer to the paper for more details.) Representative examples from the VPTT-Bench showing outputs from five methods: Baseline, Persona Only, BRAG, VPRAG (ours), and BRAG + VPRAG (ours). Each sample is evaluated using human, VLM (reasoning shown), and text-level VPTTscore-c scores, where higher indicates closer alignment to the persona's assets. Our methods achieve the highest perceptual and text-visual consistency, confirming effective contextual personalization. Notably, we observe high ranking correlation between VLM judgments, human evaluations, and our text-based proxy score.

Comparisons

BibTeX Citation

         @misc{abdal2026vptt,
        title={Visual Personalization Turing Test},
        author={Rameen Abdal and James Burgess and Sergey Tulyakov and Kuan-Chieh Jackson Wang},
        year={2026},
        eprint={2601.22680},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }