What is wide multiple baseline stereo?

Imagine you have a nice photo you took in autumn and would like to take one in summer, from the same spot. How would you achieve that? You go to the place and you start to compare what you see on the camera screen and on the printed photo. Specifically, you would probably try to locate the same objects, e.g., "that high lamppost" or "this wall clock". Then one would estimate how differently they are arranged on the old photo and camera screen. For example, by checking whether the lamppost is occluding the clock on the tower or not. That would give an idea of how you should move your camera.

Now, what if you are not allowed to take that photo with you, because it is a museum photo and taking pictures is prohibited there. Instead you can create a description of it. In that case, it is likely that you would try to make a list of features and objects in the photo together with the descriptions, which are sufficient to distinguish the objects. For example, "a long staircase on the left side", "The nice building with a dark roof and two towers" or "the top of the lampost". It would be useful to also describe where these objects and features are pictured in the photo, "The lamp posts are on the left, the closest to the viewer is in front of the left tower with a clock. The clock tower is not the part of the building and stands on its own". Then when arriving, you would try to find those objects, match them to the description you have, and try to estimate where you should go. You repeat the procedure until the camera screen shows a picture which is fitting the description you have and the image you have in your memory.

Congratulations! You just have successfully registered the two images, which have a significant difference in viewpoint, appearance, and illumination. In the process of doing so, you were solving multiple times the wide multiple-baseline stereo problem (WxBS) -- estimating the relative camera pose from a pair of images, different in many aspects, yet depicting the same scene.

Let us write down the steps, which we took.

  1. Identify salient objects and features in the images -- "trees", "statues", “tip of the tower”, etc in images.

  2. Describe the objects and features, taking into account their neighborhood: “statue with a blue left ear".

  3. Establish potential correspondences between features in the different images, based on their descriptors.

  4. Estimate, which direction one should move the camera to align the objects and features.

That is it! detailed explanation of the each of the steps is in this post. If you are interested in the formal definition, check here, and the history of the WxBS is here.