AV-Link

Baseline comparison on VGGSounds with Text Guidance

Task description: For an input audio and text prompt, AV-Link generates temporally-aligned videos. Input: audio + video text descriptoin -> video.

Baselines: We compare AV-LINK with TempoToken. We notice that TempoToken displays poor temporal alignment.

Dataset: We selected samples from VGGSounds that exhibit clear temporal actions. Our method generates 36x64 videos at 6fps. We use an upsampler to increase the generated video resoltion to 144x256.

Prompt

Ours

TempoToken

Prompt

Ours

TempoToken

It is daytime and there are many trees in the back. A shot of a large black gun being fired. Two light-skinned men in green uniforms stand next to the gun. There is white text that reads...

In the background is a wooden table with a wooden board on it, on the left is a wooden chair, on the right is a wooden cabinet. A white-skinned man is holding a gray planer and planing...

A well-lit room with a gray floor and a gray wall. There is a gray metal door on the left side of the wall. There is a white rectangular...

A room with a gray floor, a gray wall, and a blue door. There is a brown box on the floor. A light-skinned man in a black T-shirt, gray pants, and white gloves is...

A room with white walls and a green floor. There is a black and orange printer with a black and gray print head. The printer prints a picture on...

The background is black. A firework is launched into the sky.

The camera zooms out and tilts down. A well-lit room with a white wall and a brown table. A white sewing machine with a red thread on the table.

The camera tilts up and down and pans to the left and right. It is nighttime and there is a dark sky. There are several fireworks in the sky.

There is a dark room with blue lights. A light-skinned man with short dark hair plays the drums. He wears a white tank top and dark pants.

It is a sunny day. There is a street with gray buildings, gray ground, and green trees. A video game....

There is a green lawn with trees and bushes in the daytime and clear weather. A white man shoots a black rifle. He wears a black cap, a green T-shirt, and beige pants.

There is a gym with white and blue walls, a blue floor, a black punching bag, a black speaker, a black and red exercise machine, and a brown staircase. A light-skinned...

A room with white walls, a brown door, a brown cabinet, a black chair, a brown table with a metal can, and a brown barrel. A light-skinned man with short black hair, wearing..

There is a red fabric in the background. The room is bright. A light-skinned person holds a golden bowl in his hand.

It is a well-lit room with a white wall and a gray floor. On the left is a black door. A light-skinned man in a black T-shirt and black pants is standing in front of a black table...

There is a room with white walls, a white clock, a black bag, and a plastic container. A light-skinned man with gray hair and a beard is wearing a yellow T-shirt and glasses...

There is a dark room with a black curtain and a black floor. There is a drum set with a black chair in front of it. A light-skinned man with short black hair and a beard, wearing...

It is a well-lit room with a gray floor and a brown wooden wall. There is a gray metal table with a white wooden board on it. A person in black gloves and a black...

A room with white walls, a brown floor, a brown wooden chest of drawers, a brown wooden chair, a black bicycle, a gray carpet...

There is a night sky and a snowy field. There are several trees on the left. There are several yellow lights in the distance.

There is a blurred room with a beige wall, a beige table with a white scale, and a lot of different things. There is a metal xylophone with two beige balls. A light-skinned man...

A kitchen with a white wall and a black countertop. There is a white and purple box on the countertop. A silver blender with a transparent...

There is a green forest in the background. It's a sunny day. A person in black rubber boots and black pants walks along a stream.

A room with a brown floor, a brown door, and a brown wall. A white-skinned person in a black T-shirt turns on a black blender with a transparent bowl with a brown smoothie inside.

A dark sky at night. A crowd of people is gathered around a firework display, with some individuals holding up red flags.

There is a field with green and yellow grass, and trees in the background. It is daytime with a white sky. A light-skinned man is sitting on the left side of the frame, holding a black rifle...

A poorly lit room with a gray carpet and a brown wooden table. A beige dog with a black collar is lying under the table, then it gets up and runs to the left.

The camera tilts up and pans to the left. A gray concrete area surrounded by a gray fence with a black mesh. Behind the fence are green trees and bushes. It is a sunny day outside. A black and white dog..

The camera pans to the left and right, tilts down and up, and dollies forward. A bathroom with blue tiles on the walls and a black floor with a yellow rug. There is a white toilet with a beige lid...

A well-lit room with a gray floor and a white wall. There is a black table with a gray monitor on it. There is a gray printer with a black panel on the right side. The printer prints a white sheet of paper.

There is a room with black walls, a white ceiling, and a beige floor. There is a red carpet on the floor. There are musical instruments and microphones in the room. There are four light-skinned men in the room...

A room with a beige wall, a black speaker on the right, and a white shelf on the left. A light-skinned young man with short black hair in a blue shirt plays a brown cello. There is a black microphone...

White surface. The #camera-operator# stirs the white sauce in a gray pot with a black spoon, then places a piece of breaded meat in a black frying pan and fries it...

The camera pans to the right and left. It is nighttime and there is a crowd of people. There is a view of the Eiffel Tower. The camera films fireworks exploding in the sky.

It is daytime and there are red plants and a white building in the back. A dark-skinned man talks and gestures with his hands. He has long gray hair, a beard, and a white shirt.

A gray concrete floor in the daytime. A black cauldron with a black handle and a black metal frame on the bottom stands on the floor. The cauldron is filled with boiling oil..

A forest with many trees with green leaves on a sunny day. A man in a yellow helmet, green headphones, a red and green jacket, and red pants cuts down a tree with a chainsaw...

The video has a slow-motion effect. There is a room with gray walls and a large panoramic window. Outside the window is a street with...

A brown ground with a blue tarpaulin on it. Behind the ground is a brown hill. It is daytime and sunny outside. A light-skinned man is lying on a blue tarpaulin and shooting a black rifle...

A blue surface. The room is lit. A white printer with a black surface inside. The printer prints a white sheet of paper.

There is a room with a brown wooden floor, a white wall, a brown door, and a pair of black shoes on the right. A green parrot with a red head and blue tail is walking around the room.

The camera pans to the left and right, dollies forward, and tilts up and down. A dark room with gray walls, a gray floor, and a gray ceiling. There are gray boxes and a gray door in the room. The #camera-operator# is holding a gray gun...

A white table with a metal hinge on the right. A light-skinned man screws a bolt into a red metal part with a metal wrench.

A sea with waves and a rocky shore. A light-skinned young woman with dark hair, wearing a gray T-shirt, a brown necklace, and a brown stick in her right hand, smiles and looks at the camera.

There is a street with a green lawn, green trees, and a road. It's a sunny day. A light-skinned man with brown hair is playing with a skipping rope. He is wearing a black T-shirt and black shorts.

There is a forest with dry grass and trees without leaves. It's a day. There is a lioness lying on the ground and yawning.

A room with a white wall and a green table. There is a black speaker on the table. The room is lit. A light-skinned man is eating noodles from a red bowl. The man has brown hair and a brown mustache. He is wearing brown glasses and a black sweatshirt.

The camera shakes. It is a sunny day and there is a shooting range with a brown fence, a gray floor, and a gray roof. A light-skinned person in a black T-shirt holds a black and white gun in their hands and shoots at a target.

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Baseline comparison on VGGSounds with Text Guidance

Additional Qualitative Results