3D ray-field (central) and ray-based calibration
This page introduces a first “ray-based 3D” building block targeting optical systems more complex than a global pinhole model (e.g., CMO). The current implementation starts validation on synthetic pinhole data to provide a clear “oracle” reference and controlled ablations.
We compare this ray-field pipeline against Pycaso-style polynomial stereo (direct mapping + Soloff/LM) because Pycaso represents a canonical “non-parametric” stereo baseline: it learns a non-pinhole mapping between correspondences and 3D points, so it already captures some of the flexibility needed for complex optics. The new PYCASO_Z_SWEEP benchmark (and the compression stress test documented from docs/ROBUSTNESS_SWEEP.md) makes it easy to reproduce side-by-side comparisons between Pycaso and the ray-field 3D bundle adjustment.
What this enables (and what it does not)
Goal: reconstruct 3D by intersecting two rays (left/right) without assuming a global pinhole model.
Current prototype: a central ray-field (single origin per camera) where only the ray direction varies with the pixel. The origin is constant (\(o(u,v)=C\)).
Runtime: once calibrated, evaluating a ray is “pinhole-like” cost (a small basis evaluation + a normalization), then triangulation is a closed-form two-line least squares.
This chapter is organized in three parts:
GT baseline (ray-field 3D regression): learn a compact central ray-field from perfect GT correspondences.
From images (GT-assisted): detect corners in images (optionally denoised by the 2D ray-field) and evaluate reconstruction; the 3D ray-field fit can still use GT 3D to isolate measurement effects.
From images (no GT 3D): calibrate a central 3D ray-field and stereo rig from multi-pose planar observations with a point↔ray bundle adjustment (no solvePnP, no known
K).
The ChArUco denoising pipeline (2D ray-field) and its impact on OpenCV stereo calibration are described in: Stereo 3D reconstruction (OpenCV).
an “oracle” reconstruction using pinhole + Brown distortion (exact synthesis parameters),
a reconstruction using a central 3D ray-field represented with a Zernike basis.
Conventions and data
We use dataset v0 (see DATASET_SPEC.md). For a given scene, we load gt_points.npz (or gt_charuco_corners.npz):
\(P_i = (X_i, Y_i, Z_i)^\top\): 3D coordinates (mm) in the left-camera frame,
\(p^L_i = (u^L_i, v^L_i)^\top\) and \(p^R_i = (u^R_i, v^R_i)^\top\): distorted pixel projections (left/right),
baseline \(B\) (mm) from
meta.json→sim_params.baseline_mm.
We adopt the synthetic convention: the left camera center is \(C_L=(0,0,0)^\top\) and the right camera center expressed in the left frame is:
Central 3D ray-field model
A pixel \(p=(u,v)\) defines a 3D ray:
where:
\(C\) is a constant origin (central model),
\(\hat d(u,v)\in\mathbb{R}^3\) is a unit direction.
We parametrize direction via “normalized” coordinates:
The learning problem is therefore to estimate the two scalar fields \(x(u,v)\) and \(y(u,v)\).
Zernike basis (unit disk)
We map the image plane to the unit disk using:
where \(u_0,v_0\) are the image center coordinates and \(R\) is a radius covering the full image (circumscribed circle).
We use real Zernike polynomials \(Z_k(\rho,\theta)\) (defined for \(\rho\in[0,1]\)) and approximate:
In the implementation (CentralRayFieldZernike), \(K\) is set by the maximum radial order nmax (modes up to \(n\le n_{\max}\)).
Part A — GT fit (ridge / Tikhonov regression)
With GT data, each 3D point \(P_i\) lies on the ray defined by its pixel. In normalized coordinates:
We build a design matrix \(A\in\mathbb{R}^{N\times K}\) with \(A_{ik}=Z_k(\tilde u_i,\tilde v_i)\), and estimate coefficients with ridge regression (also called \(L^2\) regularization or Tikhonov):
This baseline yields a compact ray-field without non-linear optimization (closed-form ridge regression).
Triangulation and metrics
For a pair \((p^L_i, p^R_i)\) we obtain two rays:
We reconstruct \(\hat P_i\) using midpoint triangulation (midpoint of the common perpendicular segment). We report:
3D error: \(e_i = \lVert \hat P_i - P_i\rVert\) (mm),
skew-ray distance: \(d^{\mathrm{skew}}_i = \mathrm{dist}(\ell^L_i,\ell^R_i)\) (mm), i.e., the length of the common perpendicular segment.
Pinhole “oracle” baseline (reference)
On a synthetic pinhole dataset, we know the exact parameters:
focal length \(f\) (via
sim_params.f_um),Brown distortion (via
sim_params.distortion_left/right),pixel pitch (via
meta.json).
We can therefore map a distorted pixel to an undistorted ray as:
pixel \((u,v)\) → sensor coordinates \((x_{\mu m},y_{\mu m})\),
distorted normalization: \(x_d=x_{\mu m}/f_{\mu m}\), \(y_d=y_{\mu m}/f_{\mu m}\),
Brown inversion: \((x,y)=\mathrm{undistort}(x_d,y_d)\),
direction: \(\hat d = \mathrm{normalize}([x,y,1])\).
This is not a “fit”: it is an oracle (expected lower bound on this dataset).
Full GT example and comparison
Command:
.venv/bin/python paper/experiments/compare_pinhole_vs_rayfield3d_gt.py \
--scene dataset/v0_png/train/scene_0000 \
--gt gt_points.npz \
--nmax 12 \
--lam 1e-3
Metrics (mm, px, %)
We report:
triangulation error in mm: \(e_i = \lVert \hat P_i - P_i\rVert\),
relative error (order-of-magnitude): \(100\,e_i / \bar Z\) (%) where \(\bar Z\) is mean depth,
reprojection error in pixels (left/right), by reprojecting \(\hat P_i\) through the GT Brown pinhole model and comparing to GT \((u,v)\).
Outputs (summary, order-of-magnitude):
3D method |
Triangulation RMS (mm) |
Triangulation RMS (% depth) |
Reproj RMS L/R (px) |
Skew RMS (mm) |
|---|---|---|---|---|
Pinhole oracle (GT params) |
\(\approx 1\times 10^{-4}\) |
\(\approx 1\times 10^{-5}\) |
\(\approx 5\times 10^{-6}\) |
\(\approx 1\times 10^{-5}\) |
Central 3D ray-field (Zernike) |
\(\approx 3.2\times 10^{-1}\) |
\(\approx 2.4\times 10^{-2}\) |
\(\approx 4\times 10^{-2}\) |
\(\approx 2.8\times 10^{-2}\) |
Quick reading:
On pinhole data, the pinhole oracle is nearly perfect (as expected).
The central 3D ray-field is a compact approximation: its performance depends strongly on
nmax(capacity) and \(\lambda\) (smoothness). It mainly serves as a stepping stone towards future “complex optics” models.
Code references
Zernike basis (real modes + design matrix):
src/stereocomplex/core/model_compact/zernike.pyCentral model
CentralRayFieldZernike:src/stereocomplex/core/model_compact/central_rayfield.pyPinhole-oracle vs 3D ray-field GT comparison:
paper/experiments/compare_pinhole_vs_rayfield3d_gt.py
Part B — From images: detection + 2D ray-field, then reconstruction (GT-assisted)
This section connects the 3D ray-field chapter to the 2D identification pipeline:
OpenCV ChArUco detection on images (measured pixels),
center correction using the 2D ray-field (
rayfield_tps_robust),3D reconstruction by triangulation, with two 3D methods:
pinhole oracle: rays by Brown inversion using GT synthesis parameters,
central 3D ray-field: fit Zernike on \((u,v)\leftrightarrow P\) (GT) then triangulate.
Script
.venv/bin/python paper/experiments/compare_3d_from_images_rayfield2d.py \
dataset/v0_png \
--split train --scene scene_0000 \
--tps-lam 10 --tps-huber 3 --tps-iters 3 \
--nmax 12 --lam3d 1e-3
The script writes a JSON metrics file (default: paper/tables/3d_from_images_rayfield2d.json) and prints the same content to stdout.
Results (example)
The table below illustrates a run on scene_0000 (5 frames). On these synthetic images, the 2D ray-field correction substantially reduces 2D pixel error, and triangulation improves mechanically for both 3D reconstructions.
2D method |
2D RMS L/R (px) |
Pinhole oracle: 3D RMS (mm) |
3D ray-field: 3D RMS (mm) |
|---|---|---|---|
OpenCV raw |
\(\approx 0.38 / 0.36\) |
\(\approx 3.82\) |
\(\approx 3.82\) |
2D ray-field ( |
\(\approx 0.23 / 0.14\) |
\(\approx 1.28\) |
\(\approx 1.33\) |
Note: in this section the “3D ray-field” fit can use GT 3D correspondences to isolate the impact of 2D measurement noise. A full calibration without GT 3D is covered next.
Compression stress-test (PNG lossless vs WebP lossy)
To evaluate whether the ray-based pipeline remains usable under strong image compression, we compare the same scene stored as:
PNG(lossless),WebP(lossy, low quality).
This comparison runs three pipelines:
OpenCV pinhole calibration from raw ChArUco corners,
OpenCV pinhole calibration from 2D ray-field refined corners,
central 3D ray-field (bundle adjustment) calibrated from 2D ray-field refined corners.
Command (prints a markdown summary and writes a JSON report):
.venv/bin/python paper/experiments/compare_compression_3d_methods.py \
--png dataset/compression_sweep/png_lossless \
--webp dataset/compression_sweep/webp_q70 \
--split train --scene scene_0000 \
--out paper/tables/compression_compare/compression_compare_3d_methods.json
For a more complete WebP quality sweep and a discussion of non-intuitive effects (compression can sometimes help by acting as a low-pass filter), see: Image compression and 3D reconstruction.
Part C — Ray-based calibration (no GT 3D): point↔ray bundle adjustment
This section replaces the “GT-assisted 3D fit” by a full calibration from:
multi-pose board correspondences \((u,v)\leftrightarrow (X,Y,0)\),
a compact central ray-field \(d(u,v)\) (Zernike),
per-frame board poses \((R_i,t_i)\).
Connection to the general imaging model (non-central cameras)
The long-term goal (complex/non-central optics) is to represent each pixel by a 3D line rather than by a single central ray (constant origin). A standard way to represent a 3D line is via Plücker coordinates, typically written as a stacking of direction and moment vectors \((\mathbf d,\mathbf m)\), with the orthogonality constraint \(\langle \mathbf d,\mathbf m\rangle=0\). This is precisely the “general imaging model” viewpoint (pixel \(\rightarrow\) 3D line) and enables non-central cameras.
Miraldo et al. propose a compact, continuous version of the general imaging model by interpolating the line parameters with
radial basis functions (RBF), and derive a linear calibration procedure from point↔line incidence constraints
(Point-based Calibration using a Parametric Representation of the General Imaging Model, ICCV 2011, DOI: 10.1109/ICCV.2011.6126511).
Our current bundle-adjustment formulation is a central specialization: lines pass through a constant origin \(C\), so a pixel maps to a unit direction \(\hat{\mathbf d}(u,v)\) only. Extending the current code to non-central optics can follow the same structure (compact interpolation + global regularization), but with a per-pixel line representation (e.g., Plücker) instead of a single origin.
Geometric residual
For an observation \((u_{ij},v_{ij})\) of board point \(P_j\) in image \(i\):
camera-frame point: \(P^{\mathrm{cam}}_{ij}=R_i P_j + t_i\),
unit direction: \(\hat d_{ij}=\hat d(u_{ij},v_{ij})\).
The ray is \(\ell_{ij}(t)=C+t\hat d_{ij}\) (here \(C\) is constant, and we fix \(C=(0,0,0)^\top\)).
We minimize the point↔ray distance using the vector residual:
This residual is minimized with a robust loss (Huber) and an \(L^2\) regularization on Zernike coefficients.
Joint optimization (stereo)
In the stereo version, we optimize simultaneously:
Zernike coefficients of \(d_L(u,v)\) and \(d_R(u,v)\),
a single rigid pose of the rig \((R_{RL},t_{RL})\) such that \(P_R = R_{RL}P_L+t_{RL}\),
board poses per image in the left-camera frame \((R_i,t_i)\).
We solve with scipy.optimize.least_squares (robust Gauss-Newton/LM) using Huber loss and \(L^2\) coefficient regularization.
Script (images → 2D ray-field → 3D ray-field bundle adjustment → stereo)
.venv/bin/python paper/experiments/calibrate_central_rayfield3d_from_images.py \
dataset/v0_png \
--split train --scene scene_0000 \
--max-frames 5 \
--method2d rayfield_tps_robust \
--tps-lam 10 --tps-huber 3 --tps-iters 3 \
--nmax 8 --lam-coeff 1e-3 --outer-iters 3 --fscale-mm 1.0
Output: JSON (default: paper/tables/rayfield3d_ba_from_images.json) with:
estimated baseline (mm + px-equivalent at mean depth),
3D errors (mm and % depth),
reprojection errors (px), and skew-ray distances (mm),
optimization diagnostics (cost per iteration),
an
opencv_pinhole_calibsection: OpenCV pinhole calibration (intrinsics + distortion + stereo rig) on the same 2D points.
Results (example)
On scene_0000 (5 frames), the “pinhole oracle” remains a lower bound (pinhole + GT Brown). The central 3D ray-field (Zernike, central model) is calibrated without solvePnP and without a known \(K\): initial board poses are obtained from homographies (Zhang-style) only as an initialization, and the solver then directly optimizes the point↔ray cost (robust Gauss-Newton via SciPy).
3D method (same 2D points) |
Baseline abs. err. (mm) |
Baseline abs. err. (px) |
3D RMS (mm) |
Reproj RMS L/R (px) |
|---|---|---|---|---|
Pinhole oracle (GT params) |
\(0\) |
\(0\) |
\(\approx 1.28\) |
\(\approx 0.20 / 0.15\) |
OpenCV pinhole calibrated (images, non-GT) |
\(\approx 0.32\) |
\(\approx 0.29\) |
\(\approx 14.48\) |
\(\approx 3.02 / 2.77\) |
3D ray-field (bundle adjustment, central Zernike) |
\(\approx 0.21\) |
\(\approx 0.19\) |
\(\approx 1.55\) |
\(\approx 1.36 / 1.33\) |
Note: for the “3D ray-field” row, the 3D RMS and reprojections are computed after a fixed-origin similarity alignment (rotation + scale, no translation) between the reconstruction and the GT reference. Without this step, errors “in the GT frame” become arbitrarily large because the point↔ray cost does not, by itself, fix the global frame choice (gauge).
Discussion: (i) baseline, (ii) reprojection, (iii) triangulation
The table highlights three important points:
Baseline is now better with the ray-field. Here, ray-based calibration yields a smaller baseline error than OpenCV pinhole calibration (mm and px-equivalent). This is consistent with the fact that the ray-based optimization is constrained by a single rig \((R_{RL},t_{RL})\) and a geometric point↔ray cost over all observations, which limits the “intrinsics ↔ distortion ↔ extrinsics” compensations typical of pinhole calibration on planar targets.
Baseline: norm vs direction. A small error on \(\lVert C_R\rVert\) does not guarantee a perfect direction. In this example, both methods produce a slightly off-axis baseline (non-zero \(y,z\) components), so the script also reports the angle to the \(x\) axis and the off-axis norm (see the script JSON).
For example on
scene_0000: the angle is about \(3.38^\circ\) (ray-field) versus \(2.62^\circ\) (OpenCV), despite a smaller norm error on the ray-field side. This illustrates why both baseline norm and baseline direction matter.Why “non-GT pinhole” can have a decent baseline but poor reprojection/3D vs GT. OpenCV minimizes its own image error, but the errors reported here are measured against the GT model (synthetic pinhole + Brown). A pinhole calibration can thus be self-consistent (low
mono_rms_*) while still far from the GT parameters (high GT reprojection error), especially due to identifiability couplings on planar targets.
Practical takeaway:
in robotics (rectification, dense stereo), baseline accuracy and epipolar coherence often dominate matching success;
in metrology (stereo-DIC), ray-based calibration can stabilize stereo geometry when a global pinhole becomes only an approximation.
Discussion: why the ray-field calibration needs an “aligned” comparison
The point↔ray cost
is invariant to a global Euclidean transformation of the camera frame (rotation) and, to some extent, to a scale factor coupled to depth (limited identifiability on planar targets). In other words: the calibration is defined up to a gauge, while GT enforces an absolute reference frame (left camera, \(x\) axis aligned with the baseline, etc.). To avoid conflating “bad geometry” with “different frame”, we report:
an “aligned” 3D RMS (rotation + scale),
an “aligned” reprojection RMS (GT projection after alignment).
These metrics reflect practical reconstruction interest (coherence and stability), while baseline (mm and px-equivalent) remains a directly interpretable stereo-vision quantity.
Post-hoc pinhole identification from the ray-field 3D reconstruction
To assess whether a ray-based 3D reconstruction can help identify a conventional pinhole model, the script also performs a post-hoc pinhole fit:
input: reconstructed 3D points (left-camera frame) from the ray-field 3D model, and the corresponding observed pixels,
model: Brown pinhole \((K, d)\) per camera,
solver:
scipy.optimize.least_squares(Huber loss), with an additional per-camera global rotation (gauge correction).
This produces a new JSON block:
pinhole_from_rayfield3d(estimated \(K,d\) + reprojection RMS on the same correspondences),pinhole_vs_gt(relative parameter errors vs GT for synthetic datasets).
On the same example (scene_0000, 5 frames), the distortion-field error relative to GT decreases compared to the direct OpenCV pinhole calibration:
Method |
dist err L (% of GT) |
dist err R (% of GT) |
fx err L (%) |
fx err R (%) |
|---|---|---|---|---|
OpenCV pinhole calib (images → pinhole) |
18.81 |
19.26 |
0.94 |
0.44 |
Pinhole from ray-field 3D (images → ray-field 3D → pinhole) |
13.55 |
13.43 |
0.75 |
0.85 |
Commentary (Tab. Tab. 8): the distortion-field error (measured in pixel space) is reduced by about 28% on the left camera (18.81 → 13.55) and about 30% on the right camera (19.26 → 13.43) compared to the direct OpenCV pinhole calibration. This suggests that, even on synthetic pinhole data, reconstructing a geometrically consistent 3D first (ray-based) can improve the identification of a conventional pinhole + Brown model when compared to fitting pinhole parameters directly from noisy 2D detections.
Notes:
The “dist err (% of GT)” is computed in pixel space via distortion-displacement vectors on sampled circles (see
pinhole_vs_gt.*.distortion_displacement_vs_gtin the script JSON).The post-hoc reprojection RMS reported under
pinhole_from_rayfield3d.reprojection_error_*is a self-consistency metric on the same correspondences used to reconstruct the 3D points, and should not be interpreted as a standalone accuracy guarantee.
Reference: Miraldo, P., Araujo, H., Queiro, J., “Point-based Calibration using a Parametric Representation of the General Imaging Model”, ICCV 2011. DOI: 10.1109/ICCV.2011.6126511.
Usage after identification (robotics / stereo-DIC)
This section clarifies what is required and what it costs to use a ray-field model once calibrated, i.e., “after identification” of 2D correspondences (ChArUco, dense stereo matching, optical flow, DIC correlation, etc.).
Minimal inputs/outputs
To reconstruct a 3D field from a stereo pair (left/right), one needs:
Stereo model (calibration):
rig \((R_{RL},t_{RL})\),
left ray-field \(d_L(u,v)\) and right ray-field \(d_R(u,v)\) (Zernike coefficients, central model).
2D correspondences:
either pairs \((u_L,v_L)\leftrightarrow(u_R,v_R)\),
or a disparity map \(d(u,v)\) on a rectified image (classic robotics case).
Outputs:
a point cloud (or field) \(\hat P\) in mm in the left-camera frame,
optionally a per-point quality metric (skew-ray distance).
Per-point computation and algorithmic cost
For each correspondence:
Pixel → ray (left and right):
pinhole: normalization + (un)distortion + normalization → \(\hat d\),
ray-field: evaluate \(x(u,v),y(u,v)\) (Zernike), then \(\hat d=\mathrm{normalize}([x,y,1])\).
Triangulation (least-squares intersection):
midpoint of the common perpendicular segment (a few vector operations).
In terms of complexity for \(N\) correspondences:
pinhole: \(\mathcal{O}(N)\) (small constant cost),
Zernike ray-field: \(\mathcal{O}(N\,K)\) if evaluating \(K\) modes explicitly (e.g., \(K=45\) for
nmax=8), plus \(\mathcal{O}(N)\) for triangulation.
In practice, runtime can be brought to the same order as pinhole by precomputing a ray-direction map:
precompute once: \(d(u,v)\) for all image pixels (amortized cost),
real-time: lookup \(d\) + triangulation → \(\mathcal{O}(N)\).
This precompute can store a \((H\times W\times 3)\) float32 array (a few MiB), which is typically acceptable in robotics.
For a robustness evaluation across board sizes/focal lengths/aberrations (OpenCV-only sweep), see: Robustness sweep.
Real-time pipeline (robotics)
For dense stereo in robotics (depth map), a realistic pipeline is:
precompute \(d_L(u,v)\) and \(d_R(u,v)\) (ray directions) over the image grid,
compute correspondences (stereo matching):
either by rectifying to a virtual (pinhole) camera then using standard disparity,
or directly on non-rectified images using a more general matcher,
triangulate point-by-point and output depth / point cloud.
If rectifying to a virtual camera, the additional step compared to pinhole is building remap tables once, then running cv2.remap (real-time, optimized).
Two-time pipeline (stereo-DIC)
With two stereo pairs (reference + deformed), a 3D displacement field can be obtained by:
identifying stereo correspondences at \(t_0\) and \(t_1\) (or tracking points between \(t_0\to t_1\)),
triangulating \(\hat P(t_0)\) and \(\hat P(t_1)\) with the same stereo model,
computing \(\Delta \hat P = \hat P(t_1)-\hat P(t_0)\).
Again, the ray-field overhead relative to pinhole is concentrated in pixel→ray evaluation; with a precomputed map, reconstruction remains compatible with high frame rates.
Model size (“parameter complexity”)
A central Zernike ray-field remains compact:
per camera: \(2K\) coefficients (for \(x\) and \(y\)), e.g. \(2\times 45=90\) scalars at
nmax=8,stereo: +6 rig parameters \((R_{RL},t_{RL})\).
This is on the same order of magnitude as a pinhole model (focal length, principal point, distortion), but the representation is more flexible (it does not enforce a particular polynomial distortion form).
Code references
Central stereo point↔ray bundle adjustment:
src/stereocomplex/ray3d/central_stereo_ba.pyExperimental driver (images → bundle adjustment):
paper/experiments/calibrate_central_rayfield3d_from_images.py