# Stereo 3D reconstruction (OpenCV) and the impact of the ray-field

Goal: show how the **2D** improvement (ChArUco corner localization) translates into an improvement of **stereo calibration** and **3D triangulation** with “classic” OpenCV tools.

This page is intentionally separate from `docs/RAYFIELD_WORKED_EXAMPLE.md`: it focuses on the “traditional calibration + 3D reconstruction” pipeline.

## Why is this surprising on a *pinhole* dataset?

Even if images are generated from a pinhole model (with Brown distortion), the 2D measurements are not “perfect”:

- blur (including edge blur), texture interpolation, sensor noise,
- optional compression/quantization depending on the dataset,
- ArUco/ChArUco detection outliers.

In this regime, OpenCV calibration is often limited by **2D localization quality** (more than by the projection model itself).
The ray-field acts as a **geometric denoiser** on the board plane: OpenCV is fed with more coherent 2D observations. Any similarity alignment (Sim(3)) is used only to compare reconstructed 3D to world-referenced ground truth when the gauge is underconstrained; stereo scale stability is reported in the camera frame (baseline error converted to disparity pixels at mean depth), without alignment.

For a robustness study across board sizes/focal lengths/aberration levels, see: [Robustness sweep](ROBUSTNESS_SWEEP.md).

## Evaluated pipeline

For each frame (left/right pair):

1) Detect ArUco markers (marker corners).
2) Build two variants of ChArUco corners passed to OpenCV:
   - `raw`: OpenCV “raw” ChArUco corners,
   - `rayfield_tps_robust`: corners predicted by `H + TPS (λ) + IRLS (Huber)`.
3) Monocular calibration: `cv2.calibrateCamera` (left, then right).
4) Stereo calibration: `cv2.stereoCalibrate` with fixed intrinsics, estimating a **single** $(R,T)$ over **all** selected pairs.
5) Triangulation: `cv2.triangulatePoints` after `cv2.undistortPoints`.
6) Compare to the dataset 3D ground truth (`XYZ_world_mm` in `gt_charuco_corners.npz`).

### Used views (important)

The stereo rig $(R,T)$ is not estimated from a single pair: OpenCV minimizes the error over a **list of views** (a view = a left/right pair with enough corners).
The exported JSON contains:

- `n_views_left`, `n_views_right`: number of monocular views used by `calibrateCamera`,
- `n_views_stereo`: number of stereo views used by `stereoCalibrate`,
- `view_stats.*.frame_ids`: which `frame_id` actually contributed,
- `view_stats.*.n_corners`: number-of-corners statistics per view (mean/p50/p95/min/max).

### Runtime (order of magnitude)

- Planar refinement (homography + robust TPS) on CPU (Intel Core Ultra 5 228V): $\approx$0.38 s for $\sim$120 markers and $\sim$400 predicted corners (single core, Python/NumPy).
- Ray-field fitting (central, $n_{\max}{=}10$, 3 outer iters) on 5 frames: $\approx$8 s on the same CPU.
- Ray evaluation once fitted: a few μs per pixel (basis + dot products), comparable to a pinhole model.

## “Baseline error” in pixels (disparity-equivalent)

The baseline is in mm, but its error can be expressed as an equivalent disparity error (px) at depth $Z$:

```{math}
:label: eq-baseline-px
\Delta d\;(\mathrm{px}) \approx \frac{f_x\;(\mathrm{px})\;\Delta B\;(\mathrm{mm})}{Z\;(\mathrm{mm})}.
```

In the results below, we report a summary of $|\Delta d|$ over GT points (RMS/P95), which provides a more intuitive “image-domain” unit.

## Reproducible script

The script:

- compares `raw` vs `rayfield_tps_robust`,
- produces calibration and triangulation metrics,
- also compares against the GT baseline (if present in `meta.json`).

Command:

```bash
.venv/bin/python paper/experiments/compare_opencv_calibration_rayfield.py dataset/v0_png \
  --split train --scene scene_0000 \
  --out docs/assets/stereo_reconstruction_example/scene_0000_calib.json
```

Output: `docs/assets/stereo_reconstruction_example/scene_0000_calib.json`.

## Results (example)

Extract (scene_0000, split `train`):

```{list-table} Stereo calibration and triangulation summary (scene_0000, train).
:name: tab-stereo-calib-example
:header-rows: 1

* - 2D method
  - Mono RMS L (px)
  - Mono RMS R (px)
  - Stereo RMS (px)
  - Baseline $\Delta B$ (mm)
  - Baseline $|\Delta d|$ RMS (px)
  - Triangulation RMS (mm)
* - raw
  - 0.306
  - 0.302
  - 0.381
  - 0.439
  - 0.424
  - 8.986
* - rayfield\_tps\_robust
  - 0.079
  - 0.061
  - 0.163
  - -0.212
  - 0.205
  - 7.161
```

The main result is Tab. {numref}`tab-stereo-calib-example`: the 2D method only changes the quality of the 2D points provided to OpenCV, and we then observe its impact on stereo calibration and 3D reconstruction.

### Intrinsics and distortion vs GT (%)

On synthetic data, we can also compare the estimated “physical” parameters (focal length and distortion) to ground truth. The script exports:

- `mono.percent_vs_gt.left.K.fx` / `fy`: relative error (%) on $f_x, f_y$,
- relative errors (%) for each coefficient $k_1,k_2,p_1,p_2,k_3$,
- `mono.distortion_displacement_vs_gt.*`: distortion-field comparison in pixels (more robust/interpretable than comparing coefficients directly).

In the example below, the RMS GT distortion displacement is $\approx 1.404\,\mathrm{px}$ (left) and $\approx 0.947\,\mathrm{px}$ (right) on the sampled circles.

```{list-table} Relative errors (%) on focal length and distortion field (scene_0000, train).
:name: tab-mono-percent-example
:header-rows: 1

* - 2D method
  - fx L (%)
  - fy L (%)
  - dist L err (%)
  - dist L err RMS (px)
  - fx R (%)
  - fy R (%)
  - dist R err (%)
  - dist R err RMS (px)
* - raw
  - 0.062
  - 0.010
  - 14.6
  - 0.205
  - 1.672
  - 1.544
  - 15.7
  - 0.149
* - rayfield\_tps\_robust
  - 0.251
  - 0.320
  - 22.6
  - 0.317
  - 0.688
  - 0.690
  - 16.9
  - 0.160
```

Note: these percentages must be interpreted carefully, because OpenCV can trade off “intrinsics vs distortion” while keeping a low reprojection RMS. For reconstruction, Tab. {numref}`tab-stereo-calib-example` (RMS + baseline in px) remains the most direct indicator.

### Rectification: epipolar stability (vertical disparity)

To quantify the impact on a dense-stereo pipeline, the script also computes **post-rectification** metrics from the estimated model $(K_L,d_L,K_R,d_R,R,T)$:

- `vertical_disparity_measured_px`: $|y_L^{rect}-y_R^{rect}|$ on detected points,
- `vertical_disparity_gt_px`: same on GT points (same estimated rectification, hence “model error”),
- `disparity_error_measured_px`: rectified disparity error $|(x_L^{rect}-x_R^{rect})-(x_{L,GT}^{rect}-x_{R,GT}^{rect})|$.

```{list-table} Rectification metrics (scene_0000, train).
:name: tab-rectification-example
:header-rows: 1

* - 2D method
  - |Δy| RMS (px)
  - |Δy| GT RMS (px)
  - |Δd| RMS (px)
  - ray skew RMS (mm)
* - raw
  - 0.379
  - 0.244
  - 0.369
  - 0.400
* - rayfield\_tps\_robust
  - 0.218
  - 0.195
  - 0.138
  - 0.250
```

Tab. {numref}`tab-rectification-example` makes the key trade-off explicit: even if some intrinsics/distortion parameters can drift, the **epipolar coherence** (vertical disparity and disparity error) improves significantly — which is critical for stereo algorithms that assume **row-wise** correspondences.

### Discussion: epipolar stability vs “parameter truth”

With planar targets and a limited number of poses, OpenCV optimization is known to exhibit couplings between:

- intrinsics ($f_x,f_y,c_x,c_y$),
- distortion (e.g., Brown $k_1,k_2,p_1,p_2,k_3$),
- stereo relative pose ($R,T$).

The ray-field only changes the 2D observations, and can therefore shift the optimum towards a solution with more stable **epipolar geometry** (Tab. {numref}`tab-rectification-example`), without necessarily matching the GT Brown model coefficient-by-coefficient.

For reconstruction, the rectified stereo equation

```{math}
:label: eq-stereo-depth
Z = \frac{f_x\,B}{d}
```

shows that a relative error on $f_x$ (or $B$) mainly yields a global scale error on $Z$, whereas rectification errors (vertical disparity) and disparity errors $d$ directly affect matching quality and 3D noise.

## Theory: from baseline to ray intersection

In metric stereo vision (robotics, dense stereo) as well as in metrology (stereo-DIC), it is tempting to think that 3D accuracy depends only on 2D matching quality. In practice, accuracy also depends — and often primarily — on how well the **geometric model** makes the two optical rays associated with corresponding pixels **nearly intersect** in 3D.

### 1) Two 3D rays associated with a 2D correspondence

Let a 2D correspondence be $\mathbf u_L=(u_L,v_L)$ in the left image and $\mathbf u_R=(u_R,v_R)$ in the right image.
Define homogeneous coordinates $\tilde{\mathbf u}=(u,v,1)^\top$ and normalized coordinates:

```{math}
:label: eq-normalized-coords
\mathbf x_L \sim \mathbf K_L^{-1}\tilde{\mathbf u}_L,\qquad
\mathbf x_R \sim \mathbf K_R^{-1}\tilde{\mathbf u}_R.
```

In the left-camera frame, a ray can be written as a line:

```{math}
:label: eq-rays
\mathcal D_L(\lambda)=\mathbf C_L+\lambda\,\mathbf d_L,\qquad
\mathcal D_R(\mu)=\mathbf C_R+\mu\,\mathbf d_R,
```

where $\mathbf C_L=(0,0,0)^\top$, $\mathbf d_L$ is the normalized $\mathbf x_L$, and $\mathbf d_R$ is $\mathbf x_R$ expressed in the left frame.
With OpenCV’s `stereoCalibrate` convention ($\mathbf X_R=\mathbf R\,\mathbf X_L+\mathbf T$), the right-camera center in the left frame is:

```{math}
:label: eq-right-center
\mathbf C_R = -\mathbf R^\top \mathbf T.
```

From each correspondence we thus obtain two rays $(\mathcal D_L,\mathcal D_R)$; the **ray skew** is defined as the minimum distance between these two lines in space. An analytical expression is:
```{math}
\delta(\mathcal D_L,\mathcal D_R)=\left\|\left((\mathbf C_R-\mathbf C_L)\cdot\mathbf n\right)\,\mathbf n\right\|,
\quad \text{with }\mathbf n=\frac{\mathbf d_L\times\mathbf d_R}{\|\mathbf d_L\times\mathbf d_R\|},
```
where $\mathbf n$ is the unit normal to the plane spanned by the two directions. The **ray skew P95** reported in the tables is the 95th percentile of $\delta(\mathcal D_L,\mathcal D_R)$ over all correspondences, and serves as a geometry-consistency diagnostic: perfect geometry yields skew close to zero, whereas large skews indicate that the modeled rays no longer intersect because of pose/intrinsic errors or noisy observations.

### 2) The practical case: skew lines

In a perfect world, $\mathcal D_L$ and $\mathcal D_R$ intersect exactly at the 3D point $\mathbf X$.
In practice (imperfect calibration, residual 2D noise), the two lines are not intersecting: they are **skew**.

Triangulation algorithms (e.g., `cv2.triangulatePoints`) then choose a best-fit point $\hat{\mathbf X}$, typically by minimizing a reprojection criterion or by finding the point closest to both lines.
A useful geometric quantity is the **minimum distance between the two lines**, which directly measures “how much the rays miss each other”.
For $\mathbf C_L=\mathbf 0$, this distance (per point) can be written:

```{math}
:label: eq-skew-distance
d_{\mathrm{skew}} = \frac{\left|(\mathbf C_R)\cdot(\mathbf d_L\times \mathbf d_R)\right|}{\lVert \mathbf d_L\times \mathbf d_R\rVert}.
```

The script exports this metric in mm: `stereo.ray_skew_distance_mm` (RMS/P95/max). It does not replace a GT error, but it explains *why* a calibration may yield unstable triangulation even if 2D correspondences look plausible.

### 3) Epipolar constraint and the role of the baseline

The ideal condition for a pair $(\mathbf x_L,\mathbf x_R)$ to correspond to a common 3D point under a model $(\mathbf R,\mathbf T)$ is the **epipolar constraint**:

```{math}
:label: eq-epipolar
\mathbf x_R^\top \mathbf E\,\mathbf x_L = 0,
\qquad \mathbf E = [\mathbf T]_{\times}\mathbf R,
```

where $[\mathbf T]_{\times}$ is the skew-symmetric matrix associated with the cross product. For $\mathbf T=(t_x,t_y,t_z)^\top$:

```{math}
:label: eq-cross-matrix
[\mathbf T]_{\times} =
\begin{bmatrix}
0 & -t_z & t_y\\
t_z & 0 & -t_x\\
-t_y & t_x & 0
\end{bmatrix},
\qquad
[\mathbf T]_{\times}\,\mathbf a = \mathbf T \times \mathbf a.
```

An error on the baseline (or rotation) makes $\mathbf E$ inconsistent: observed pairs $(\mathbf x_L,\mathbf x_R)$ no longer satisfy the constraint, which manifests as more skew rays (higher $d_{\mathrm{skew}}$) and less reliable rectification (higher $|\Delta y|$ and disparity errors; see Tab. {numref}`tab-rectification-example`).

### 4) Why this matters for robotics and stereo-DIC

- **Robotics / dense stereo**: rectification assumes nearly horizontal correspondences. Reducing $|\Delta y|$ and the post-rectification disparity error facilitates “row-wise” matching and reduces depth noise.
- **Metrology / stereo-DIC**: even if rectification is sometimes avoided (to limit interpolation), reconstruction still relies on triangulation with $(K,d,R,T)$. Stabilizing epipolar geometry reduces ray inconsistency, hence 3D bias/noise induced by calibration.

### Metric definitions (table columns)

- **2D method**: how 2D corners $(u,v)$ are produced before being passed to OpenCV.
  - `raw`: OpenCV raw ChArUco corners.
  - `rayfield_tps_robust`: corners predicted by `H + TPS (λ) + IRLS (Huber)` from ArUco corners.
- **Mono RMS L (px)**: reprojection RMS (px) returned by `cv2.calibrateCamera` on the left camera using corners from the given method.
- **Mono RMS R (px)**: same for the right camera.
- **Stereo RMS (px)**: reprojection RMS (px) returned by `cv2.stereoCalibrate` (fixed intrinsics), using corners from the given method on left/right pairs.
- **Baseline $\Delta B$ (mm)**: error on the norm of the estimated translation,

  ```{math}
  :label: eq-baseline-mm
  \Delta B = \lVert \mathbf T\rVert - B_{\mathrm{GT}}.
  ```

- **Baseline $|\Delta d|$ RMS (px)**: converts baseline error into an “equivalent disparity” error (px) at GT depths,

  ```{math}
  :label: eq-baseline-px-abs
  |\Delta d| = \left|\frac{f_x\,\Delta B}{Z}\right|,
  ```

  then summarizes $|\Delta d|$ over GT points (RMS/P95/max). This is the most intuitive unit for reconstruction impact.
- **Triangulation RMS (mm)**: RMS 3D error $\lVert \hat{\mathbf X}-\mathbf X_{\mathrm{GT}}\rVert$ (mm) after `cv2.triangulatePoints` (on undistorted points), summarized over all triangulated corners.

To make this value interpretable, the script also exports:

- **depth\_mm**: depth distribution $Z$ (mm) of used GT points (P05/P50/P95),
- **triangulation\_error\_rel\_z\_percent**: relative error $100\,\lVert \hat{\mathbf X}-\mathbf X_{\mathrm{GT}}\rVert/Z$ (RMS/P95/max).

Thus, a “RMS = 7.4 mm” can be read as “$\approx 0.55\%$ at $Z \approx 1.3\,\mathrm{m}$” for this scene.

### Interpreting triangulation (mm) vs working distance

Absolute errors (mm) depend strongly on working distance: for a constant disparity error $\sigma_d$ (px), the classic stereo approximation yields:

```{math}
:label: eq-depth-error
\sigma_Z \approx \frac{Z^2}{f_x\,B}\,\sigma_d
\quad\Longrightarrow\quad
\frac{\sigma_Z}{Z} \approx \frac{Z}{f_x\,B}\,\sigma_d \approx \frac{\sigma_d}{d},
```

where $Z$ is depth, $B$ the baseline, $f_x$ the focal length (px), $d$ the disparity (px).
We therefore also report a relative metric (% of $Z$), which enables comparisons across scenes with different distances.

On the example of Tab. {numref}`tab-stereo-calib-example`:

```{list-table} Depth and normalized 3D error (same example).
:name: tab-stereo-triang-interpret
:header-rows: 1

* - 2D method
  - Depth P50 (mm)
  - Depth [P05, P95] (mm)
  - Triang RMS (%Z)
  - Triang P95 (%Z)
* - raw
  - 1539
  - [909, 1612]
  - 0.615
  - 1.057
* - rayfield\_tps\_robust
  - 1539
  - [909, 1612]
  - 0.518
  - 0.589
```

Tab. {numref}`tab-stereo-triang-interpret` shows that, despite “visually large” mm errors, the relative error is on the order of $10^{-2}$ (percent), and that the ray-field improvement is consistent with the strong drop in baseline error in pixel units.

Note: on synthetic datasets v0, `gt_charuco_corners.npz` provides `XYZ_world_mm`. The script assumes that this 3D is consistent with the triangulation convention (left-camera frame), which holds for `dataset/v0_png/train/scene_0000` (verified by reprojection).

Reading guide:

- The drop in reprojection RMS (mono + stereo) indicates that OpenCV absorbs much less localization error.
- The baseline error in pixels (equivalent disparity) drops significantly, showing that stereo geometry (scale) becomes much more stable.
- 3D triangulation improves as well, but it also depends on the quality of estimated intrinsics/distortion and on pose geometry.