Radiance Fields from Deep-based Structure-from-Motion

Link Copied!

1. Introduction

What is Structure-from-Motion?

Structure from Motion (SfM) is a computer vision technique that reconstructs 3D structures from 2D images. It's widely used in various fields such as drone mapping, robotics, and virtual reality. The primary goal of SfM is to estimate both the camera positions and the 3D point cloud from multiple overlapping images.

COLMAP: The Gold Standard in SfM

COLMAP is a widely recognized software for Structure from Motion (SfM) and Multi-View Stereo (MVS), playing a crucial role in the 3D reconstruction community.

Developed by Johannes L. Schönberger, COLMAP has become the gold standard for 3D reconstruction tasks due to its robustness, accuracy, and versatility. It offers several robust features:

How COLMAP Works

Feature Extraction: The first step in COLMAP is to detect and extract features from input images using methods like SIFT (Scale-Invariant Feature Transform).
Feature Matching: COLMAP matches features between pairs of images to find correspondences, which are essential for 3D reconstruction.
Incremental Reconstruction: COLMAP incrementally builds the 3D structure by starting with an initial image pair and gradually adding more images, refining the 3D model along the way.
Bundle Adjustment: After the incremental reconstruction, COLMAP performs global optimization (Bundle Adjustment) to refine the camera poses and 3D points.
Dense Reconstruction: Finally, COLMAP generates a dense point cloud by performing Multi-View Stereo on the calibrated images.

Limitations of COLMAP

Despite its strengths, COLMAP has some limitations:

Sensitivity to Image Quality: The accuracy of COLMAP heavily depends on the quality and overlap of the input images. Poorly aligned or low-quality images can lead to errors in the reconstruction.
Limited Scalability & Robustness: While COLMAP works well for medium-sized datasets, it can struggle with extremely large or small datasets due to its incremental approach or lack of visual coherence.
Time-Consuming: The reconstruction process, especially dense reconstruction, can be time-consuming, making it less suitable for real-time applications.

2. Deep Learning-Based Camera Pose Reconstruction

To overcome the limitations of traditional SfM methods like COLMAP, recent developments have introduced deep learning-based approaches for camera pose reconstruction. In the later of this post, we’ll compare two notable methods: VGGSfM and MASt3R, evaluating their performance and discussing their strengths and weaknesses.

VGGSfM: Visual Geometry Grounded Deep Structure from Motion

VGGSfM is a deep-learning-based approach to Structure-from-Motion (SfM) that offers a significant advancement over traditional methods like COLMAP. Unlike conventional SfM, which relies on a sequential, non-differentiable process, VGGSfM employs an end-to-end differentiable pipeline.

vggsfm overview — **Figure 3.** Overview of VGGSfM

It starts with deep feature extraction and 2D point tracking, followed by simultaneous initialization of all camera poses, differentiable triangulation, and joint optimization through bundle adjustment.

This integration allows VGGSfM to refine each step during training, leading to more accurate and robust 3D reconstructions. The method scales effectively, handles complex scenes, and has shown state-of-the-art results on various benchmarks. In contrast, while COLMAP remains a powerful tool, its reliance on traditional techniques limits its adaptability and performance in comparison to VGGSfM.

MASt3R: Grounding Image Matching in 3D with MASt3R

MASt3R (Matching And Stereo 3D Reconstruction) is a cutting-edge framework designed to enhance 3D scene reconstruction and dense image matching, particularly in challenging conditions with extreme viewpoint changes. It builds on the DUSt3R architecture by employing a shared Vision Transformer (ViT) encoder and cross-attention decoders to capture spatial relationships and 3D geometry between image pairs.

mast3r overview — **Figure 4.** Overview of MASt3R.
Compared to the DUSt3R framework, MASt3R's contributions are highlighted in blue.

MASt3R predicts dense 3D pointmaps and local feature maps, using a coarse-to-fine matching strategy to achieve high accuracy in matching. Its use of a fast nearest-neighbor algorithm and iterative fine-tuning ensures robust performance. Compared to traditional 2D methods like those in COLMAP, MASt3R's 3D-centric approach excels in environments requiring precise visual localization and mapping.

3. Radiance Fields from Deep-based SfM

To validate geometric feasibility of wild deep-based Structure-from-Motion methodologies, we will show the radiance fields reconsturction from VGGSfM and MASt3R. The objective is to compare their performances and understand the advantages and limitations of each approach.

PointCloud Reconstruction

VGGSfM	MASt3R

Radiance Fields Reconstruction

VGGSfM	MASt3R

Summary

MASt3R is not suitable for inverse rendering but provides denser and more diverse point cloud reconstructions compared to VGGSfM. VGGSfM's accurate camera pose reconstruction, utilizing Bundle Adjustment, makes it more suitable for inverse rendering. Specifically, VGGSfM's camera pose has less than 0.01 angular distance error compared to COLMAP, while MASt3R's pose has over 0.1 angular distance error.

Both methods are more robust than COLMAP. In my experiment, COLMAP fails to reconstruct all the above datasets.

Both methods have shortcomings with limited VRAM capacity, but VGGSfM handles it better. VGGSfM can reconstruct over 90 images on a single RTX 4090, whereas MASt3R struggles with over 30 images.

Further Camera Pose Refinement

As discussed in InstantSplat, MASt3R (and VGGSfM) poses can serve as a good initial point of the camera pose optimization during radiance fields training (BARF-likes method). Below is the toy experiment of MASt3R + further camera pose optimization (Using Splatfacto):

Closing

This post compared two 3D point cloud reconstruction methods, VGGSfM and MASt3R. VGGSfM excels in camera pose reconstruction, making it more suitable for inverse rendering, while MASt3R provides denser point clouds but is less appropriate for inverse rendering tasks. Both methods offer robustness over COLMAP and can yield better results in certain scenes, although they require further parameter tuning and optimization.

GitHub Repository: [Link]