Definition of WxBS

Let us denote observations $O_{i}, i=1..n$, each of which belongs to one of the views $V_{j}, j=1..m$, $m \leq n$.

Observation consist of spatial information and the "descriptor". View contains the information, which is shared and the same for a group of observations.

For example, a single observation can be an RGB pixel. Its spatial information is the pixel coordinates and the "descriptor" is RGB value. The view then is the image, with information about the camera pose, camera intrinsics, sensor and the time of the photo. Some of this information can be unknown to the user, i.e. hidden variable.

Another example could an event camera[EventCameraSurvey2020]. In that case the observation contains the pixel coordinates and the descriptor is the sign of the intensity change. The view will contain the information about the sensor, camera pose and the single observation inside it, because every event has an unique timestamp.

Observations and views can be of different nature and dimentionality. E.g. $V_1$, $V_2$ -- RGB images, $V_3$ -- point cloud from a laser scaner, $V_4$ -- image from a thermal camera, and so on.

An unordered pair of observations $(O_{i},O_{k})$ forms a correspondence $c_{ik}$ if they are belong to different views $V_{j}$. The group of observations is called multivew correspondence $C_o$, when there is exactly one observation $O_i$ per view $V_j$. Some of observations $O_i$ can be empty $\varnothing$, i.e. not observed in the specific view $V_j$.

The world model is the set of contraints on views, observations and correspondences. For example, one the popular models are epipolar geometry and ridid motion assumption.

The correspondence is called ground truth or veridical if it satisfy the constraints posed be the world model.

We can now define a wide baseline stereo.

By wide baseline stereo we understand the process of establishing two or multi-view correspondences $C_o$ from observations $O_i$ and images $V_{j}$ and recovering the missing information about the views and estimatting the unknown parameters of the world model.

Most often in the current thesis we will be using the the following world model. The scene consists of 3 dimentional elements, and is rigid and static. The observations are the 2D projections to the camera plane by the projectice pinhole camera. The relationship between observations in different views is either epipolar geometry, or projective transform[Hartley2004]. Any moving object does not satisty the world model and therefore is considered an occlusion. We will call the "baseline" the distance between the camera centers.

For example, on image below, observations $O_i$ are blue circles and the correspondences $c_{jk}$ are shown as lines. The assumed object $X_i$ is a red circle.

We will call "wide multiple baseline stereo" or WxBS [Mishkin2015WXBS] if the observations have different nature or the conditions under which observation were made are different.

The different between wide baseline stereo and short baseline stereo, or, simply stereo is the follwing. In stereo the baseline is small -- less then 1 meter -- and typically known and fixed. The task is to establish correspondences, which can be done by 1D search along the known epipolar lines.

In contrast, in wide baseline stereo the baseline is unknown, mostly unconstrained and the viewpoints of the cameras can vary drastically.

The wide baseline stereo, which also outputs the estimation of the latent objects, e.g. in form of 3d point world coordinates we would call rigid structure-from-motion (rigid SfM) or 3D reconstruction. We do not consider object shape approximation with voxels, meshes, etc in the current thesis. Nor we consider the recovery of scene albedo, illumination, and other appearance properties.

While the difference between SfM and WBS is often blurred and the terms are used interchangeably, we would consider WBS as a part of SfM pipeline prior to recovering 3d point cloud.

Other correspondence problems, as tracking, optical flow or establishing semantic correspondences could be defined using the terminilogy we established.

References

(Gallego, Delbruck et al., 2020) Gallego Guillermo, Delbruck Tobi, Orchard Garrick Michael et al., ``Event-based Vision: A Survey'', IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. , number , pp. , 2020.

(Hartley and Zisserman, 2004) R.~I. Hartley and A. Zisserman, ``Multiple View Geometry in Computer Vision'', 2004.

(Mishkin, Matas et al., 2015) D. Mishkin, J. Matas, M. Perdoch et al., ``WxBS: Wide Baseline Stereo Generalizations'', BMVC, 2015.