How does it work?

Mathematically, it is completely impossible to reconstruct a 3D scene from a single image. And yet when we humans look at a photograph, we see not just a plane filled with color and texture, but the world behind the image. How do we do it?
We believe that this amazing ability of humans comes from years of experience of living in a highly structured world, in which most scenes consist of vertical objects resting on a ground plane. Our insight is that if we can just figure which parts of the image correspond to ground, vertical surfaces, and the sky, we can often construct a simple 3D model of the scene. Our approach is to learn the structure of the world and the appearance of geometric surfaces from a large set of training images. We can then apply that knowledge to new photographs. If we can determine where the vertical surfaces contact the ground in the image, we can recover the depth of those surfaces (up to a scale), giving us a 3D model.
To create the final result, we simply texture map from the original image onto the model.
Photo Credit: Amtrak image by Todd Bliwise.
Photo Credit: CMU image by James Hays
