Notes - Computer Vision MT25, Multiple view geometry

Course - Computer Vision MT25^U
- Notes - Computer Vision MT25, Homogenous coordinates and homographies^U
- Notes - Computer Vision MT25, Camera models^U

Flashcards

What are multi-view geometry problems?

mvg-problem-definition

Given cameras and correspondences, find a 3D reconstruction of a scene.

Epipolar geometry

Suppose we have two cameras with centres $O$ and $O’$. @Define and @visualise the baseline.

epipolar-baseline-definition

The baseline is the line connecting the two origins.

Suppose we have two cameras with centres $O$ and $O’$ connected by their baseline.

@Define and @visualise the epipoles $e$ and $e$’. What does this look like when the cameras lie on the same line?

epipoles-and-special-case-parallel

The epipoles are where the baseline intersects with the image plane (equivalently, the projections of the other camera in each view).

When the baseline is parallel to the image planes (the “motion parallel to image plane” case, e.g. a lateral stereo rig), the epipoles are infinitely far away and the epipolar lines are parallel.

Suppose we have two cameras with centres $O$ and $O’$ with epipoles $e$ and $e’$, along with a point $X$ which projects onto $x$ and $x’$ respectively.

@Define and visualise the epipolar plane and the epipolar lines in this context.

epipolar-plane-and-lines

The epipolar plane is the plane formed by $X$, $O$ and $O’$.
The epipolar lines connect the epipoles to the projections of $X$, or equivalently the intersection of the epipolar plane with the image plane.

Suppose we observe a single point $x$ in some image taken by a camera with centre $O$.

Given another camera with centre $O’$, where can we find the $x’$ corresponding to the $x$ in the other image?

matching-point-on-epipolar-line

Along the epipolar line corresponding to $x$.

@Visualise what the epipolar lines would look like given a reference image and a target image.

epipolar-lines-visualisation

@State and @visualise the epipolar constraint.

epipolar-constraint-statement

Whenever two points $x$ and $x’$ lie on matching epipolar lines $l$ and $l’$, the visual rays corresponding to them meet in space at some point $X$ (although they do not have to be projections of the same 3D point)

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above

What are the two projection matrices $P, P’$ from the world coordinates to the camera coordinates?

two-camera-projection-matrices

$K[I \mid 0]$
$K’[R \mid t]$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$

Given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, where are $x$ and $x’$ in normalised image coordinates?

normalised-image-coordinates-conversion

$x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
$x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

@State four different ways of expressing the relationship between $x$ and $x’$:

By simply stating their exact relationship
Using the cross product
Using the “matrix form” of the cross product
Using the essential matrix

essential-matrix-four-derivations

Stating their exact relationship:

The direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

This means that $x’$, $Rx$ and $t$ are linearly dependent, so:

Using the cross product:

\[x' \cdot [t \times (Rx)] = 0\]

Since

\[a \times b = \begin{bmatrix} 0 & -a _ 3 & a _ 2 \\ a _ 3 & 0 & -a _ 1 \\ -a _ 2 & a _ 1 & 0 \end{bmatrix} \begin{pmatrix} b _ 1 \\ b _ 2 \\ b _ 3 \end{pmatrix} := [a _ \times] b\]

we have:

Using the “matrix form” of the cross product:

\[x'^\top [t _ \times] Rx = 0\]

so defining $[t _ x] R = E$ as the “essential matrix”:

Using the essential matrix:

\[x'^\top E x = 0\]

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

@State how you may:

Determine the epipolar line $l’$ corresponding to a point $x$
Determine the epipolar line $l$ corresponding to a point $x’$

essential-matrix-epipolar-lines

$l’ \cong Ex$
$l \cong E^\top x’$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

@State the relationship between the essential matrix $E$ and the epipolar points $e$ and $e’$.

essential-matrix-epipoles

$Ee = 0$
$E^\top e’ = 0$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

@State the rank and the degrees of freedom of $E$.

essential-matrix-rank-dof

Rank: $2$, so that $E$ is singular ($\det E = 0$)
- Why? $R$ is a rotation matrix, hence invertible, so $\text{rank}(E) = \text{rank}([t] _ \times)$. The cross-product matrix $[t] _ \times$ sends $v \mapsto t \times v$, so $t \times t = 0$, the vector $t$ lies in its null space. So $[t] _ \times$ has rank at most $2$, and exactly $2$ for any non-zero $t$.
Degrees of freedom: $5$.
- Why? $R$ contributes 3 degrees of freedom (a rotation has three parameters), and $t$ contributes 3 degrees of freedom. The epipolar constraint $x’^\top E x = 0$ is invariant to scaling $E$ by any scalar, so $E$ is only defined up to scale, removing 1 degree of freedom.

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

What assumptions do we drop in order to @define the fundamental matrix?

fundamental-matrix-from-uncalibrated

We assume that the calibration matrices $K$ and $K’$ of the two cameras are unknown
This gives $x’^\top F x = 0$ where $F = (K’^{-1})^\top E K^{-1}$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

For the fundamental matrix, we drop some assumptions:

We assume that the calibration matrices $K$ and $K’$ of the two cameras are unknown
This gives $x’^\top F x = 0$ where $F = (K’^{-1})^\top E K^{-1}$

@State:

The relationship between the essential matrix $E$ and the epipolar points $e$ and $e’$.
How you would determine the epipolar line $l’$ corresponding to a point $x$
How you would determine the epipolar line $l$ corresponding to a point $x’$

fundamental-matrix-epipolar-lines-and-epipoles

$l’ \cong Fx$
$l \cong F^\top x’$
$Fe = 0$
$F^\top e’ = 0$

Consider the following setup:

We have two cameras at points $O$ and $O’$, of which the intrinsic $K, K’$ and extrinsic parameters are known
The world coordinate system is set to that of the first camera
The rotation and translation from $O$ to $O’$ is given as above, so that the projection matrices are given by $K[I \mid 0]$ and $K’[R \mid t]$
To simplify, we work in normalised image coordinates: given $X$, $x _ \text{pixel}$ and $x’ _ \text{pixel}$, we have
- $x _ \text{norm} = K^{-1} x _ \text{pixel} \cong [I \mid 0] X$
- $x’ _ \text{norm} = {K’}^{-1} x’ _ \text{pixel} \cong [R \mid t]X$

Then the direct transformation between $x$ and $x’$ gives

\[x' \cong Rx + t\]

this means that $x’$, $Rx$ and $t$ are linearly dependent, so we have

\[x' \cdot [t \times (Rx)] = 0\]

Collecting this into matrix form and defining $[t _ x] R = E$ as the “essential matrix”, we have

\[x'^\top E x = 0\]

For the fundamental matrix, we drop some assumptions:

We assume that the calibration matrices $K$ and $K’$ of the two cameras are unknown
This gives $x’^\top F x = 0$ where $F = (K’^{-1})^\top E K^{-1}$

@State the rank and degrees of freedom of $F$.

fundamental-matrix-rank-dof

Rank: $2$, so $F$ is singular ($\det F = 0$).
- Why? $K$, $L’$ are upper triangular with non-zero diagonal entries (focal length, pixel scaling factors), so they’re invertible. Multiplying by invertible matrices preserves rank, so $\text{rank}(F) = \text{rank}(E) = 2$.
Degrees of freedom: $7$.
- Why? A $3 \times 3$ matrix has $9$ entries. Subtract $1$ for scale (since $x’^\top F x = 0$ is invariant to scaling $F$). Subtract another $1$ for the rank-2 constraint $\det F = 0$. Hence we have rank $9 - 1 - 1 = 7$.
- Why more than $E$? $F$ absorbs the unknown calibration matrices.

Given that the fundamental matrix satisfies the equation

\[x'^\top F x = 0\]

Derive the eight-point algorithm for determining an estimate of $F$.

eight-point-algorithm-derivation

Explicitly, we require for each pair of points $\pmb x, \pmb x’$

\[\begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix}^\top \begin{bmatrix} f _ {11} & f _ {12} & f _ {13} \\ f _ {21} & f _ {22} & f _ {23} \\ f _ {31} & f _ {32} & f _ {33} \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = 0\]

which, vectorising $f$ gives

\[\begin{pmatrix} x'x & x'y & x' &y'x & y'y & y' & x & y & 1 \end{pmatrix} \begin{pmatrix} f _ {11} \\ f _ {12} \\ f _ {13} \\ f _ {21} \\ f _ {22} \\ f _ {23} \\ f _ {31} \\ f _ {32} \\ f _ {33} \\ \end{pmatrix} = 0\]

Collecting this into a matrix equation for each of the point correspondences, we have

\[Uf = 0\]

where

\[\begin{aligned} U &= \begin{pmatrix} x _ 1'x _ 1 & x _ 1'y _ 1 & x _ 1' & y _ 1'x _ 1 & y _ 1'y _ 1 & y _ 1' & x _ 1 & y _ 1 & 1 \\ & & & & \vdots & & & & \\ x _ n'x _ n & x _ n'y _ n & x _ n' & y _ n'x _ n & y _ n'y _ n & y _ n' & x _ n & y _ n & 1 \\ \end{pmatrix} \\ \\ \end{aligned}\]

which can be solved via least squares. To enforce the fact $F$ must be rank $2$, we can take the best rank-2 approximation of the found solution using truncated SVD.

Bite-sized

The essential matrix $E = [t] _ \times R$ was introduced by Longuet-Higgins, Nature 1981 in the paper “A computer algorithm for reconstructing a scene from two projections”. The eight-point algorithm for estimating $F$ from correspondences was popularised by Hartley in 1997.

Source: Lecture 15, Derivation slide (Longuet-Higgins reference); Hartley & Zisserman Multiple View Geometry.

@bite~

bite-essential-matrix-historical-attribution

The eight-point algorithm requires at minimum 8 point correspondences to estimate the fundamental matrix $F$ (which has 9 entries, defined up to scale, giving 8 unknowns). With more than 8 correspondences, solve via least squares.

Source: Lecture 15, The Eight Point Algorithm slide.

@bite~

bite-eight-point-minimum-correspondences

@Describe how rank-2 enforcement via SVD truncation works in the eight-point algorithm, and why it’s necessary.

bite-eight-point-rank2-enforcement

The eight-point algorithm gives an initial estimate $F _ \text{init}$ from a least-squares fit to $Uf = 0$. Because the constraint $\det F = 0$ is not directly enforced during the least-squares solve, $F _ \text{init}$ is typically full rank — but the true fundamental matrix must be rank 2.

Why rank-2 is required: $F$ has the geometric meaning that $Fx$ is the epipolar line of $x$; all epipolar lines must pass through the epipole $e’$, which means $F^\top e’ = 0$, so $F$ must have a non-trivial null space, i.e. rank $< 3$. A full-rank $F _ \text{init}$ would predict epipolar lines that don’t quite converge at the epipole — physically impossible.

The fix: take the SVD of the initial estimate

\[F _ \text{init} = U \Sigma V^\top, \quad \Sigma = \mathrm{diag}(\sigma _ 1, \sigma _ 2, \sigma _ 3),\]

zero out the smallest singular value,

\[\Sigma' = \mathrm{diag}(\sigma _ 1, \sigma _ 2, 0),\]

and reconstruct

\[F = U \Sigma' V^\top.\]

This gives the closest rank-2 matrix to $F _ \text{init}$ in Frobenius norm (Eckart-Young theorem), and produces an $F$ whose epipolar lines actually meet at a single epipole.

Source: Lecture 15, Enforcing Rank-2 Constraint slide.

@bite~

@Describe the three special configurations of epipolar geometry covered in the lecture.

bite-epipolar-geometry-three-cases

Converging cameras (general case): the baseline lies in front of both image planes, so the epipoles are finite points and may even be visible inside one of the images. Epipolar lines fan out from the visible epipole. Standard case.
Motion parallel to the image plane (e.g. a lateral stereo rig with both cameras facing forward): the baseline is parallel to both image planes, so the epipoles are at infinity, and the epipolar lines are parallel horizontal lines in both images. Stereo-rectified images are normalised to this case.
Motion perpendicular to the image plane (e.g. a camera moving directly forward): the epipoles coincide with the principal points of both cameras, and epipolar lines radiate outward from the principal point (the focus of expansion). Useful for monocular depth estimation from forward motion.

Source: Lecture 15, Example: Converging Cameras, Example: Motion Parallel to Image Plane, and Example: Motion Perpendicular to Image Plane slides.

@bite~

@Justify why the epipolar constraint $\pmb x’^\top E \pmb x = 0$ is necessary but not sufficient for two image points to be projections of the same 3D point.

bite-epipolar-constraint-necessary-not-sufficient

Necessary: If $\pmb x, \pmb x’$ are projections of the same 3D point $X$, then by construction $X, O, O’$ are coplanar (they define the epipolar plane), which translates algebraically to the epipolar constraint $\pmb x’^\top E \pmb x = 0$. So genuine correspondences always satisfy the constraint.

Not sufficient: The constraint only says $\pmb x’$ lies on the epipolar line $\ell’ = E \pmb x$ — a 1D locus in the second image. There are many 3D points whose rays from the second camera all lie in the same epipolar plane and so project to different points on $\ell’$. Picking any of them gives a different $\pmb x’$ that still satisfies the constraint but corresponds to a different 3D point.

Geometrically: the epipolar constraint reduces a 2D search (the whole second image) to a 1D search (the epipolar line), but you still need a matching step (intensity, descriptor, or local cost) to find the actual corresponding point on $\ell’$.

Source: Lecture 15, Epipolar Constraint: Key Results slide.

@bite~

@Justify the rank-2 property of the essential matrix $E = [t] _ \times R$.

bite-essential-matrix-rank2-justification

The cross-product matrix $[t] _ \times$ has rank at most 2 for any non-zero $t$: it sends $\pmb v \mapsto t \times \pmb v$, and $t \times t = \pmb 0$, so $t$ is always in its null space — making its kernel at least 1-dimensional. For a 3D vector $t \ne 0$, the kernel is exactly 1-dimensional, so $\mathrm{rank}([t] _ \times) = 2$.

$R$ is a rotation matrix, hence orthogonal and invertible (rank 3). Multiplying $[t] _ \times$ on the right by an invertible matrix doesn’t change the rank:

\[\mathrm{rank}(E) = \mathrm{rank}([t] _ \times R) = \mathrm{rank}([t] _ \times) = 2.\]

Equivalently: $E^\top e’ = 0$ for the epipole $e’$, so $E^\top$ has a non-trivial null space.

Source: Lecture 15, The Essential Matrix: Properties slide.

@bite~

The relationship $F = (K’)^{-\top} E K^{-1}$ shows that the fundamental matrix is essentially the essential matrix expressed in pixel rather than normalised coordinates. Since $K, K’$ are invertible (upper triangular with positive diagonal), $F$ has the same rank as $E$ (namely 2), but $F$ has 7 DoF rather than 5 because it also absorbs the unknown calibration.

Source: Lecture 15, Epipolar Constraint: Uncalibrated Case slide.

@bite~

bite-fundamental-from-essential

Flashcards

Epipolar geometry

Bite-sized

Related posts