CVPR 2023 Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployments

Overview

Recent breakthroughs in computer vision unlock a number of applications previously unavailable to users. Discriminative tasks, such as detection, segmentation, pose & depth estimation, and others, reach incredible accuracy. Generative applications of adversarial networks, autoencoders, diffusion models, image-to-image translation, video synthesis, and animation methods show high fidelity of their inputs, making it hard to be distinguished from the real by even a human observer. Neural Radiance Fields promise democratized object reconstruction and rendering for various applications.

These breakthroughs, however, come with the cost of high computational requirements of such models. For example, transformer models with quadratic complexity according to the token length, image-to-image translation comes with large FLOPs, and neural rendering approaches require sophisticated rendering pipelines. This translates into the need to run such models on a server-side, elevated service cost, and suboptimal user experience due to the latency incurred for transmitting the data and receiving the result.

Bringing such methods to edge devices is notoriously difficult. Existing works are targeting the compression of large models that do not necessarily lead to on-device acceleration, as it requires not only neural network compression background but also domain-specific background. For instance, to run neural rendering on a device, one needs to be experienced in graphics; to run generative models, one needs to know how to reduce the size without sacrificing the quality of the output. Speeding up neural networks for server-side inference will not necessarily improve edge inference, as deploying them on edge devices requires considerable model size reductions, sometimes at a couple of orders of magnitude.

On the other hand, the benefits of edge inference are obvious: 1. The cost of service is greatly reduced. 2. User privacy is elevated by design, as all the processing runs on the device without data transmission between the server and the device. 3. It facilitates user interaction, as the absence of transmission reduces the latency. 4. Reduces the cycle time to build many novel CV-based applications as expensive infrastructure is not required.

This tutorial will introduce effective methodologies for re-designing algorithms for efficient content understanding, image generation, and neural rendering. Most importantly, we show how the algorithms can be efficiently deployed on mobile devices, eventually achieving real-time interaction between users and mobile devices.

Program

Introduction

Sergey Tulyakov

08:30 -
08:40

Keynote
PDF

Efficient Transformer-Based Vision Models

Introduction of the bottleneck for the state-of-the-art transformer-based vision models, i.e., high computation cost on mobile devices, and how to design efficient networks to solve these problems.

Jian Ren

08:40 -
09:10

Keynote
PDF

Efficient Text-to-Image Generation

Introduction of the recent advances in large-scale text-to-image generation models, and how to optimize them to enable mobile applications.

Jian Ren

09:10 -
09:45

Keynote
PDF

Efficient Object Rendering and the Mobile Applications

Introduction of the emerging efforts on neural rendering, especially for Neural Radiance Fields (NeRF). The challenges for deploying NeRF for real-time usages and how to alleviate such issues for building on-device real-time neural rendering.

Jian Ren,
Sergey Tulyakov

09:55 -
10:55

Keynote1
PDF1 Keynote2
PDF2

Efficient deployment on mobile devices

Methods for how to efficiently deploy the algorithms on mobile devices to build computer vision related applications.

Ju (Eric) Hu

11:05 -
11:50

Slides
PDF

Closing Remarks

Sergey Tulyakov

11:50 -
11:55

About the Speakers

Jian Ren is a Lead Research Scientist in the Creative Vision team at Snap Research. His research focuses on content understanding, image and video generation and manipulation, and methods of designing efficient neural networks for the former two areas. His works result in 20+ papers published in top-tier conferences (CVPR, ICCV, ECCV, ICLR, NeurIPs, ICML) and patents contributing to multiple products. He got Ph.D. in Computer Engineering from Rutgers University in 2019 and a B.S. from the University of Science and Technology of China in 2014. Before joining Snap Inc, Jian did internships in Adobe, Snap, and Bytedance.

Sergey Tulyakov is a Principal Research Scientist at Snap Inc, where he leads the Creative Vision team. His work focuses on creating methods for manipulating the world via computer vision and machine learning. This includes human and object understanding, photorealistic manipulation and animation, video synthesis, prediction and retargeting. He pioneered the unsupervised image animation domain with MonkeyNet and First Order Motion Model that sparked a number of startups in the domain. His work on Interactive Video Stylization received the Best in Show Award at SIGGRAPH Real-Time Live! 2020. He has published 30+ top conference papers, journals and patents resulting in multiple innovative products, including Snapchat Pet Tracking, OurBaby, Real-time Neural Lenses (gender swap, baby face, aging lens, face animation) and many others. Before joining Snap Inc., Sergey was with Carnegie Mellon University, Microsoft, NVIDIA. He holds a PhD degree from the University of Trento, Italy.

Ju (Eric) Hu is a Machine Learning Engineer at Snap Inc. His work mainly focuses on supporting and optimizing Snap's in-house ML framework called SnapML. SnapML aims to provide fast and efficient inference by leveraging different hardwares to achieve real-time performance on mobile devices. Before joining Snap, he worked at a medical imaging startup focusing on skin lesion detection and classification. He graduated from UCLA with a B.S. in Math.

CVPR 2023 Tutorial on

Date: Sun 18 Jun 8:30 a.m. PDT — noon PDT.

Location: West 212

Recorded Video

Overview

Organizers

Program

About the Speakers