Cavia: Camera-controllable Multi-view Video Diffusion

with View-Integrated Attention



Figure A. Comparison against ViewCrafter on testing images and trajectories from RealEstate10K dataset.

ViewCrafter's output contains unpleasant color and lighting artifacts and geometry distortions.
In comparison, Cavia's output is more consistent and natural.

ViewCrafter
Ours
Reference Video

Figure B. Comparison against the concurrent work CVD.

CVD suffers from severe morphing artifacts and unnatural object motion(e.g. water droplets frozen in the air). More importantly, it fails to adhere to the text prompt instructions (marked in red) and often ignores important details.
In comparison, Cavia's output is more geometrically consistent and enjoys more natural object motion.

CVD
Ours (View 0)
Ours (View 1)

"A small stone skipping across a still pond, creating ripples that glow with a faint magical light. The stone and ripples are clear, while the background of trees and sky is hazy and out of focus"

"Detailed illustration of a modern city street at dawn, with smooth pavement and tall glass buildings in the distance, no traffic or pedestrians, softly lit by the rising sun, cinematic composition, trending on Artstatio"

"A panda lazily eating bamboo while sitting under a tree in a lush forest. Its black-and-white fur stands out against the greenery, and the dappled sunlight creates soft shadows around it, emphasizing the peaceful atmosphere"

"A fluffy cavia sitting in a small grassy patch, nibbling on a dandelion. The soft green grass surrounds it, and the sun casts a gentle light on its fur, creating a peaceful, natural scene"