3D ray-field (central) and ray-based calibration

This page introduces a first “ray-based 3D” building block targeting optical systems more complex than a global pinhole model (e.g., CMO). The current implementation starts validation on synthetic pinhole data to provide a clear “oracle” reference and controlled ablations.

We compare this ray-field pipeline against Pycaso-style polynomial stereo (direct mapping + Soloff/LM) because Pycaso represents a canonical “non-parametric” stereo baseline: it learns a non-pinhole mapping between correspondences and 3D points, so it already captures some of the flexibility needed for complex optics. The new PYCASO_Z_SWEEP benchmark (and the compression stress test documented from docs/ROBUSTNESS_SWEEP.md) makes it easy to reproduce side-by-side comparisons between Pycaso and the ray-field 3D bundle adjustment.

What this enables (and what it does not)

Goal: reconstruct 3D by intersecting two rays (left/right) without assuming a global pinhole model.
Current prototype: a central ray-field (single origin per camera) where only the ray direction varies with the pixel. The origin is constant (\(o(u,v)=C\)).
Runtime: once calibrated, evaluating a ray is “pinhole-like” cost (a small basis evaluation + a normalization), then triangulation is a closed-form two-line least squares.

This chapter is organized in three parts:

GT baseline (ray-field 3D regression): learn a compact central ray-field from perfect GT correspondences.
From images (GT-assisted): detect corners in images (optionally denoised by the 2D ray-field) and evaluate reconstruction; the 3D ray-field fit can still use GT 3D to isolate measurement effects.
From images (no GT 3D): calibrate a central 3D ray-field and stereo rig from multi-pose planar observations with a point↔ray bundle adjustment (no solvePnP, no known K).

The ChArUco denoising pipeline (2D ray-field) and its impact on OpenCV stereo calibration are described in: Stereo 3D reconstruction (OpenCV).

an “oracle” reconstruction using pinhole + Brown distortion (exact synthesis parameters),
a reconstruction using a central 3D ray-field represented with a Zernike basis.

Conventions and data

We use dataset v0 (see DATASET_SPEC.md). For a given scene, we load gt_points.npz (or gt_charuco_corners.npz):

\(P_i = (X_i, Y_i, Z_i)^\top\): 3D coordinates (mm) in the left-camera frame,
\(p^L_i = (u^L_i, v^L_i)^\top\) and \(p^R_i = (u^R_i, v^R_i)^\top\): distorted pixel projections (left/right),
baseline \(B\) (mm) from meta.json → sim_params.baseline_mm.

We adopt the synthetic convention: the left camera center is \(C_L=(0,0,0)^\top\) and the right camera center expressed in the left frame is:

\[C_R = (B,0,0)^\top.\]

Central 3D ray-field model

A pixel \(p=(u,v)\) defines a 3D ray:

\[\ell_p(t) = C + t\,\hat d(u,v), \quad t\ge 0\]

where:

\(C\) is a constant origin (central model),
\(\hat d(u,v)\in\mathbb{R}^3\) is a unit direction.

We parametrize direction via “normalized” coordinates:

\[\begin{split}\tilde d(u,v) = \begin{bmatrix} x(u,v) \\ y(u,v) \\ 1 \end{bmatrix}, \qquad \hat d(u,v)=\frac{\tilde d(u,v)}{\lVert\tilde d(u,v)\rVert}.\end{split}\]

The learning problem is therefore to estimate the two scalar fields \(x(u,v)\) and \(y(u,v)\).

Zernike basis (unit disk)

We map the image plane to the unit disk using:

\[\tilde u = \frac{u-u_0}{R},\qquad \tilde v = \frac{v-v_0}{R},\]

where \(u_0,v_0\) are the image center coordinates and \(R\) is a radius covering the full image (circumscribed circle).

We use real Zernike polynomials \(Z_k(\rho,\theta)\) (defined for \(\rho\in[0,1]\)) and approximate:

\[x(u,v) = \sum_{k=1}^{K} a_k Z_k(\tilde u,\tilde v), \qquad y(u,v) = \sum_{k=1}^{K} b_k Z_k(\tilde u,\tilde v).\]

In the implementation (CentralRayFieldZernike), \(K\) is set by the maximum radial order nmax (modes up to \(n\le n_{\max}\)).

Part A — GT fit (ridge / Tikhonov regression)

With GT data, each 3D point \(P_i\) lies on the ray defined by its pixel. In normalized coordinates:

\[x_i = \frac{X_i}{Z_i},\qquad y_i = \frac{Y_i}{Z_i}.\]

We build a design matrix \(A\in\mathbb{R}^{N\times K}\) with \(A_{ik}=Z_k(\tilde u_i,\tilde v_i)\), and estimate coefficients with ridge regression (also called \(L^2\) regularization or Tikhonov):

\[\hat a = \arg\min_a\ \lVert Aa-x\rVert^2 + \lambda\lVert a\rVert^2, \qquad \hat b = \arg\min_b\ \lVert Ab-y\rVert^2 + \lambda\lVert b\rVert^2.\]

This baseline yields a compact ray-field without non-linear optimization (closed-form ridge regression).

Triangulation and metrics

For a pair \((p^L_i, p^R_i)\) we obtain two rays:

\[\ell^L_i(t)= C_L + t\,\hat d_L(p^L_i), \qquad \ell^R_i(s)= C_R + s\,\hat d_R(p^R_i).\]

We reconstruct \(\hat P_i\) using midpoint triangulation (midpoint of the common perpendicular segment). We report:

3D error: \(e_i = \lVert \hat P_i - P_i\rVert\) (mm),
skew-ray distance: \(d^{\mathrm{skew}}_i = \mathrm{dist}(\ell^L_i,\ell^R_i)\) (mm), i.e., the length of the common perpendicular segment.

Pinhole “oracle” baseline (reference)

On a synthetic pinhole dataset, we know the exact parameters:

focal length \(f\) (via sim_params.f_um),
Brown distortion (via sim_params.distortion_left/right),
pixel pitch (via meta.json).

We can therefore map a distorted pixel to an undistorted ray as:

pixel \((u,v)\) → sensor coordinates \((x_{\mu m},y_{\mu m})\),
distorted normalization: \(x_d=x_{\mu m}/f_{\mu m}\), \(y_d=y_{\mu m}/f_{\mu m}\),
Brown inversion: \((x,y)=\mathrm{undistort}(x_d,y_d)\),
direction: \(\hat d = \mathrm{normalize}([x,y,1])\).

This is not a “fit”: it is an oracle (expected lower bound on this dataset).

Full GT example and comparison

Command:

.venv/bin/python paper/experiments/compare_pinhole_vs_rayfield3d_gt.py \
  --scene dataset/v0_png/train/scene_0000 \
  --gt gt_points.npz \
  --nmax 12 \
  --lam 1e-3

Metrics (mm, px, %)

We report:

triangulation error in mm: \(e_i = \lVert \hat P_i - P_i\rVert\),
relative error (order-of-magnitude): \(100\,e_i / \bar Z\) (%) where \(\bar Z\) is mean depth,
reprojection error in pixels (left/right), by reprojecting \(\hat P_i\) through the GT Brown pinhole model and comparing to GT \((u,v)\).

Outputs (summary, order-of-magnitude):

Tab. 5 3D comparison on GT (pinhole oracle vs central 3D ray-field).
3D method	Triangulation RMS (mm)	Triangulation RMS (% depth)	Reproj RMS L/R (px)	Skew RMS (mm)
Pinhole oracle (GT params)	\(\approx 1\times 10^{-4}\)	\(\approx 1\times 10^{-5}\)	\(\approx 5\times 10^{-6}\)	\(\approx 1\times 10^{-5}\)
Central 3D ray-field (Zernike)	\(\approx 3.2\times 10^{-1}\)	\(\approx 2.4\times 10^{-2}\)	\(\approx 4\times 10^{-2}\)	\(\approx 2.8\times 10^{-2}\)

Quick reading:

On pinhole data, the pinhole oracle is nearly perfect (as expected).
The central 3D ray-field is a compact approximation: its performance depends strongly on nmax (capacity) and \(\lambda\) (smoothness). It mainly serves as a stepping stone towards future “complex optics” models.

Code references

Zernike basis (real modes + design matrix): src/stereocomplex/core/model_compact/zernike.py
Central model CentralRayFieldZernike: src/stereocomplex/core/model_compact/central_rayfield.py
Pinhole-oracle vs 3D ray-field GT comparison: paper/experiments/compare_pinhole_vs_rayfield3d_gt.py

Part B — From images: detection + 2D ray-field, then reconstruction (GT-assisted)

This section connects the 3D ray-field chapter to the 2D identification pipeline:

OpenCV ChArUco detection on images (measured pixels),
center correction using the 2D ray-field (rayfield_tps_robust),
3D reconstruction by triangulation, with two 3D methods:
- pinhole oracle: rays by Brown inversion using GT synthesis parameters,
- central 3D ray-field: fit Zernike on \((u,v)\leftrightarrow P\) (GT) then triangulate.

Script

.venv/bin/python paper/experiments/compare_3d_from_images_rayfield2d.py \
  dataset/v0_png \
  --split train --scene scene_0000 \
  --tps-lam 10 --tps-huber 3 --tps-iters 3 \
  --nmax 12 --lam3d 1e-3

The script writes a JSON metrics file (default: paper/tables/3d_from_images_rayfield2d.json) and prints the same content to stdout.

Results (example)

The table below illustrates a run on scene_0000 (5 frames). On these synthetic images, the 2D ray-field correction substantially reduces 2D pixel error, and triangulation improves mechanically for both 3D reconstructions.

Tab. 6 3D reconstruction from images (OpenCV raw vs 2D ray-field), with two 3D methods.
2D method	2D RMS L/R (px)	Pinhole oracle: 3D RMS (mm)	3D ray-field: 3D RMS (mm)
OpenCV raw	\(\approx 0.38 / 0.36\)	\(\approx 3.82\)	\(\approx 3.82\)
2D ray-field (`rayfield_tps_robust`)	\(\approx 0.23 / 0.14\)	\(\approx 1.28\)	\(\approx 1.33\)

Note: in this section the “3D ray-field” fit can use GT 3D correspondences to isolate the impact of 2D measurement noise. A full calibration without GT 3D is covered next.

Compression stress-test (PNG lossless vs WebP lossy)

To evaluate whether the ray-based pipeline remains usable under strong image compression, we compare the same scene stored as:

PNG (lossless),
WebP (lossy, low quality).

This comparison runs three pipelines:

OpenCV pinhole calibration from raw ChArUco corners,
OpenCV pinhole calibration from 2D ray-field refined corners,
central 3D ray-field (bundle adjustment) calibrated from 2D ray-field refined corners.

Command (prints a markdown summary and writes a JSON report):

.venv/bin/python paper/experiments/compare_compression_3d_methods.py \
  --png dataset/compression_sweep/png_lossless \
  --webp dataset/compression_sweep/webp_q70 \
  --split train --scene scene_0000 \
  --out paper/tables/compression_compare/compression_compare_3d_methods.json

For a more complete WebP quality sweep and a discussion of non-intuitive effects (compression can sometimes help by acting as a low-pass filter), see: Image compression and 3D reconstruction.

Part C — Ray-based calibration (no GT 3D): point↔ray bundle adjustment

This section replaces the “GT-assisted 3D fit” by a full calibration from:

multi-pose board correspondences \((u,v)\leftrightarrow (X,Y,0)\),
a compact central ray-field \(d(u,v)\) (Zernike),
per-frame board poses \((R_i,t_i)\).

Connection to the general imaging model (non-central cameras)

The long-term goal (complex/non-central optics) is to represent each pixel by a 3D line rather than by a single central ray (constant origin). A standard way to represent a 3D line is via Plücker coordinates, typically written as a stacking of direction and moment vectors \((\mathbf d,\mathbf m)\), with the orthogonality constraint \(\langle \mathbf d,\mathbf m\rangle=0\). This is precisely the “general imaging model” viewpoint (pixel \(\rightarrow\) 3D line) and enables non-central cameras.

Miraldo et al. propose a compact, continuous version of the general imaging model by interpolating the line parameters with radial basis functions (RBF), and derive a linear calibration procedure from point↔line incidence constraints (Point-based Calibration using a Parametric Representation of the General Imaging Model, ICCV 2011, DOI: 10.1109/ICCV.2011.6126511).

Our current bundle-adjustment formulation is a central specialization: lines pass through a constant origin \(C\), so a pixel maps to a unit direction \(\hat{\mathbf d}(u,v)\) only. Extending the current code to non-central optics can follow the same structure (compact interpolation + global regularization), but with a per-pixel line representation (e.g., Plücker) instead of a single origin.

Geometric residual

For an observation \((u_{ij},v_{ij})\) of board point \(P_j\) in image \(i\):

camera-frame point: \(P^{\mathrm{cam}}_{ij}=R_i P_j + t_i\),
unit direction: \(\hat d_{ij}=\hat d(u_{ij},v_{ij})\).

The ray is \(\ell_{ij}(t)=C+t\hat d_{ij}\) (here \(C\) is constant, and we fix \(C=(0,0,0)^\top\)).

We minimize the point↔ray distance using the vector residual:

\[r_{ij} = (I - \hat d_{ij}\hat d_{ij}^\top)\,P^{\mathrm{cam}}_{ij}.\]

This residual is minimized with a robust loss (Huber) and an \(L^2\) regularization on Zernike coefficients.

Joint optimization (stereo)

In the stereo version, we optimize simultaneously:

Zernike coefficients of \(d_L(u,v)\) and \(d_R(u,v)\),
a single rigid pose of the rig \((R_{RL},t_{RL})\) such that \(P_R = R_{RL}P_L+t_{RL}\),
board poses per image in the left-camera frame \((R_i,t_i)\).

We solve with scipy.optimize.least_squares (robust Gauss-Newton/LM) using Huber loss and \(L^2\) coefficient regularization.

Script (images → 2D ray-field → 3D ray-field bundle adjustment → stereo)

.venv/bin/python paper/experiments/calibrate_central_rayfield3d_from_images.py \
  dataset/v0_png \
  --split train --scene scene_0000 \
  --max-frames 5 \
  --method2d rayfield_tps_robust \
  --tps-lam 10 --tps-huber 3 --tps-iters 3 \
  --nmax 8 --lam-coeff 1e-3 --outer-iters 3 --fscale-mm 1.0

Output: JSON (default: paper/tables/rayfield3d_ba_from_images.json) with:

estimated baseline (mm + px-equivalent at mean depth),
3D errors (mm and % depth),
reprojection errors (px), and skew-ray distances (mm),
optimization diagnostics (cost per iteration),
an opencv_pinhole_calib section: OpenCV pinhole calibration (intrinsics + distortion + stereo rig) on the same 2D points.

Results (example)

On scene_0000 (5 frames), the “pinhole oracle” remains a lower bound (pinhole + GT Brown). The central 3D ray-field (Zernike, central model) is calibrated without solvePnP and without a known \(K\): initial board poses are obtained from homographies (Zhang-style) only as an initialization, and the solver then directly optimizes the point↔ray cost (robust Gauss-Newton via SciPy).

Tab. 7 Central ray-based calibration from images: comparison to the pinhole oracle (example).
3D method (same 2D points)	Baseline abs. err. (mm)	Baseline abs. err. (px)	3D RMS (mm)	Reproj RMS L/R (px)
Pinhole oracle (GT params)	\(0\)	\(0\)	\(\approx 1.28\)	\(\approx 0.20 / 0.15\)
OpenCV pinhole calibrated (images, non-GT)	\(\approx 0.32\)	\(\approx 0.29\)	\(\approx 14.48\)	\(\approx 3.02 / 2.77\)
3D ray-field (bundle adjustment, central Zernike)	\(\approx 0.21\)	\(\approx 0.19\)	\(\approx 1.55\)	\(\approx 1.36 / 1.33\)

Note: for the “3D ray-field” row, the 3D RMS and reprojections are computed after a fixed-origin similarity alignment (rotation + scale, no translation) between the reconstruction and the GT reference. Without this step, errors “in the GT frame” become arbitrarily large because the point↔ray cost does not, by itself, fix the global frame choice (gauge).

Discussion: (i) baseline, (ii) reprojection, (iii) triangulation

The table highlights three important points:

Baseline is now better with the ray-field. Here, ray-based calibration yields a smaller baseline error than OpenCV pinhole calibration (mm and px-equivalent). This is consistent with the fact that the ray-based optimization is constrained by a single rig \((R_{RL},t_{RL})\) and a geometric point↔ray cost over all observations, which limits the “intrinsics ↔ distortion ↔ extrinsics” compensations typical of pinhole calibration on planar targets.
Baseline: norm vs direction. A small error on \(\lVert C_R\rVert\) does not guarantee a perfect direction. In this example, both methods produce a slightly off-axis baseline (non-zero \(y,z\) components), so the script also reports the angle to the \(x\) axis and the off-axis norm (see the script JSON).

For example on scene_0000: the angle is about \(3.38^\circ\) (ray-field) versus \(2.62^\circ\) (OpenCV), despite a smaller norm error on the ray-field side. This illustrates why both baseline norm and baseline direction matter.
Why “non-GT pinhole” can have a decent baseline but poor reprojection/3D vs GT. OpenCV minimizes its own image error, but the errors reported here are measured against the GT model (synthetic pinhole + Brown). A pinhole calibration can thus be self-consistent (low mono_rms_*) while still far from the GT parameters (high GT reprojection error), especially due to identifiability couplings on planar targets.

Practical takeaway:

in robotics (rectification, dense stereo), baseline accuracy and epipolar coherence often dominate matching success;
in metrology (stereo-DIC), ray-based calibration can stabilize stereo geometry when a global pinhole becomes only an approximation.

Discussion: why the ray-field calibration needs an “aligned” comparison

The point↔ray cost

\[r_{ij}=(I-\hat d_{ij}\hat d_{ij}^\top)\,P^{\mathrm{cam}}_{ij}\]

is invariant to a global Euclidean transformation of the camera frame (rotation) and, to some extent, to a scale factor coupled to depth (limited identifiability on planar targets). In other words: the calibration is defined up to a gauge, while GT enforces an absolute reference frame (left camera, \(x\) axis aligned with the baseline, etc.). To avoid conflating “bad geometry” with “different frame”, we report:

an “aligned” 3D RMS (rotation + scale),
an “aligned” reprojection RMS (GT projection after alignment).

These metrics reflect practical reconstruction interest (coherence and stability), while baseline (mm and px-equivalent) remains a directly interpretable stereo-vision quantity.

Post-hoc pinhole identification from the ray-field 3D reconstruction

To assess whether a ray-based 3D reconstruction can help identify a conventional pinhole model, the script also performs a post-hoc pinhole fit:

input: reconstructed 3D points (left-camera frame) from the ray-field 3D model, and the corresponding observed pixels,
model: Brown pinhole \((K, d)\) per camera,
solver: scipy.optimize.least_squares (Huber loss), with an additional per-camera global rotation (gauge correction).

This produces a new JSON block:

pinhole_from_rayfield3d (estimated \(K,d\) + reprojection RMS on the same correspondences),
pinhole_vs_gt (relative parameter errors vs GT for synthetic datasets).

On the same example (scene_0000, 5 frames), the distortion-field error relative to GT decreases compared to the direct OpenCV pinhole calibration:

Tab. 8 Pinhole parameter identification vs GT (example; lower is better).
Method	dist err L (% of GT)	dist err R (% of GT)	fx err L (%)	fx err R (%)
OpenCV pinhole calib (images → pinhole)	18.81	19.26	0.94	0.44
Pinhole from ray-field 3D (images → ray-field 3D → pinhole)	13.55	13.43	0.75	0.85

Commentary (Tab. Tab. 8): the distortion-field error (measured in pixel space) is reduced by about 28% on the left camera (18.81 → 13.55) and about 30% on the right camera (19.26 → 13.43) compared to the direct OpenCV pinhole calibration. This suggests that, even on synthetic pinhole data, reconstructing a geometrically consistent 3D first (ray-based) can improve the identification of a conventional pinhole + Brown model when compared to fitting pinhole parameters directly from noisy 2D detections.

Notes:

The “dist err (% of GT)” is computed in pixel space via distortion-displacement vectors on sampled circles (see pinhole_vs_gt.*.distortion_displacement_vs_gt in the script JSON).
The post-hoc reprojection RMS reported under pinhole_from_rayfield3d.reprojection_error_* is a self-consistency metric on the same correspondences used to reconstruct the 3D points, and should not be interpreted as a standalone accuracy guarantee.

Reference: Miraldo, P., Araujo, H., Queiro, J., “Point-based Calibration using a Parametric Representation of the General Imaging Model”, ICCV 2011. DOI: 10.1109/ICCV.2011.6126511.

Usage after identification (robotics / stereo-DIC)

This section clarifies what is required and what it costs to use a ray-field model once calibrated, i.e., “after identification” of 2D correspondences (ChArUco, dense stereo matching, optical flow, DIC correlation, etc.).

Minimal inputs/outputs

To reconstruct a 3D field from a stereo pair (left/right), one needs:

Stereo model (calibration):
- rig \((R_{RL},t_{RL})\),
- left ray-field \(d_L(u,v)\) and right ray-field \(d_R(u,v)\) (Zernike coefficients, central model).
2D correspondences:
- either pairs \((u_L,v_L)\leftrightarrow(u_R,v_R)\),
- or a disparity map \(d(u,v)\) on a rectified image (classic robotics case).

Outputs:

a point cloud (or field) \(\hat P\) in mm in the left-camera frame,
optionally a per-point quality metric (skew-ray distance).

Per-point computation and algorithmic cost

For each correspondence:

Pixel → ray (left and right):
- pinhole: normalization + (un)distortion + normalization → \(\hat d\),
- ray-field: evaluate \(x(u,v),y(u,v)\) (Zernike), then \(\hat d=\mathrm{normalize}([x,y,1])\).
Triangulation (least-squares intersection):
- midpoint of the common perpendicular segment (a few vector operations).

In terms of complexity for \(N\) correspondences:

pinhole: \(\mathcal{O}(N)\) (small constant cost),
Zernike ray-field: \(\mathcal{O}(N\,K)\) if evaluating \(K\) modes explicitly (e.g., \(K=45\) for nmax=8), plus \(\mathcal{O}(N)\) for triangulation.

In practice, runtime can be brought to the same order as pinhole by precomputing a ray-direction map:

precompute once: \(d(u,v)\) for all image pixels (amortized cost),
real-time: lookup \(d\) + triangulation → \(\mathcal{O}(N)\).

This precompute can store a \((H\times W\times 3)\) float32 array (a few MiB), which is typically acceptable in robotics.

For a robustness evaluation across board sizes/focal lengths/aberrations (OpenCV-only sweep), see: Robustness sweep.

Real-time pipeline (robotics)

For dense stereo in robotics (depth map), a realistic pipeline is:

precompute \(d_L(u,v)\) and \(d_R(u,v)\) (ray directions) over the image grid,
compute correspondences (stereo matching):
- either by rectifying to a virtual (pinhole) camera then using standard disparity,
- or directly on non-rectified images using a more general matcher,
triangulate point-by-point and output depth / point cloud.

If rectifying to a virtual camera, the additional step compared to pinhole is building remap tables once, then running cv2.remap (real-time, optimized).

Two-time pipeline (stereo-DIC)

With two stereo pairs (reference + deformed), a 3D displacement field can be obtained by:

identifying stereo correspondences at \(t_0\) and \(t_1\) (or tracking points between \(t_0\to t_1\)),
triangulating \(\hat P(t_0)\) and \(\hat P(t_1)\) with the same stereo model,
computing \(\Delta \hat P = \hat P(t_1)-\hat P(t_0)\).

Again, the ray-field overhead relative to pinhole is concentrated in pixel→ray evaluation; with a precomputed map, reconstruction remains compatible with high frame rates.

Model size (“parameter complexity”)

A central Zernike ray-field remains compact:

per camera: \(2K\) coefficients (for \(x\) and \(y\)), e.g. \(2\times 45=90\) scalars at nmax=8,
stereo: +6 rig parameters \((R_{RL},t_{RL})\).

This is on the same order of magnitude as a pinhole model (focal length, principal point, distortion), but the representation is more flexible (it does not enforce a particular polynomial distortion form).

Code references

Central stereo point↔ray bundle adjustment: src/stereocomplex/ray3d/central_stereo_ba.py
Experimental driver (images → bundle adjustment): paper/experiments/calibrate_central_rayfield3d_from_images.py