Neural Rendering Beyond Photography + Quick Viewer Tip

Link Copied!

TL; DR. In this article, we will explore how to apply radiance fields technique (NeRF and Gaussian Splatting) beyond photometric domain. Also, I will post the short tip for the quick development of the NeRF / GS viewer which has no official viewer.

Introduction

Recently, I had the serendipitous opportunity to explore how Neural Rendering could be applied within the medical domain, specifically in Tomography (X-ray), referencing SAX-NeRF.

Unlike photography, which captures the reflection of visible light, tomography constructs images based on signals that have penetrated a substance. In tracing the authors' investigative process of how to adapt the standard Neural Rendering setting for Tomography, I was able to deepen my abstract intuition and comprehension of Neural Rendering.

In this post, I aim to elucidate the foundational intuitions behind Neural Rendering, NeRF, and GS through a personal review of SAX-NeRF. Additionally, I will offer insights into a technique for creating a simple web viewer using viser when an official viewer is unavailable.

1. Tomography vs. Photography

1.1. Photography

- What are the underlying physical principles of the act of 'seeing'?

- '본다' 라는 행위는 물리학적으로 어떤 의미를 지닐까?

'Seeing' physically denotes the process where light emitted from a source interacts with an object, causing the object's surface to absorb or reflect specific wavelengths (or energy) of light. The unabsorbed, reflected light then reaches the observer's eye (or detection device), leading to visual perception.

This process encompasses the interaction between light and an object (absorption and reflection) and the subsequent transmission of the interaction's result to the observer's visual system.

Neural Rendering and Novel View Synthesis primarily focus on scenes 'observed within the visible light spectrum'. Consequently, the standard NeRF rendering equation (emission-absorption ray casting) mathematically represents the intuition behind the act of 'seeing'.

Neural Rendering, Novel View Synthesis 에서 주로 다루는 scene 은 '가시광 영역에서 관찰한 대상' 이 목표이다. 따라서 흔히 사용하는 NeRF 의 rendering equation (emission-absorption ray casting) 은 '본다' 라는 과정의 직관을 수학적으로 나타내고 있다.

$$ C(r) = \int_{t_n}^{t_f} T(t) \cdot \sigma ({\rm \textbf{r}}(t)) \cdot c({\rm \textbf{r}}(t), {\rm \textbf{d}} ) \ dt $$

$$ \text{where }T(t) = \text{exp} \bigg( - \int_{t_n}^{t}\sigma ({\rm \textbf{r}}(s)) ds \bigg ) $$

In this equation, $C(r)$ represents the final accumulated color along a ray. The constituent components of the equation are interpreted as follows:

$\textbf{r}(t)$ : Represents the ray of light.
$\sigma ({\rm \textbf{r}}(t))$: The density (opacity) or absorption coefficient at a point $t$ along the ray, indicating the extent of interaction between the ray and the object.
$c({\rm \textbf{r}}(t), {\rm \textbf{d}})$: The color of the light reflected by the object at that point. (d is used to reflect view-dependent color.)
$T(t)$: The accumulated transmittance, indicating whether previously encountered objects blocked (opaque) or allowed passage (transparent) of the ray.

Thus, NeRF's rendering equation mathematically formalizes the process of visual object perception. By integrating all elements of the physical light-object interaction (absorption, reflection, transparency, etc.), NeRF calculates the cumulative contribution of light along the ray path to ultimately generate an image.

1.2. Tomography

So, what considerations are necessary for applying Neural Rendering to domains outside the visible light spectrum?

A quintessential example in this domain is Tomography, including X-ray and CT scans. Tomography utilizes X-rays, which have shorter wavelengths (and higher energy) than visible light. Instead of reflection, the process is based on the penetration of X-rays through an object and the attenuation of the light's intensity by the object's internal density. This contrasts with the reflection-based nature of visible light, and the internal structure of the object is reconstructed by analyzing the intensity of the light that passes through and reaches the detector.

In Tomography, this penetration and attenuation process is modeled by the Beer-Lambert Law, which is specifically expressed as:

$$ I_\text{gt}(\mathbf{r}) = I_o \cdot \exp \left ( - \int_{t_n}^{t_f} \rho(\mathbf{r}(t)) dt \right ) $$

Here, $I_\text{gt}(\mathbf{r})$ is the final intensity measured by the detector, and $I_0$ is the initial intensity value.
$\rho$ is the radiodensity value, representing the degree to which X-rays are attenuated.

The resemblance of the exponential term is notable. It exhibits an identical form to the Accumulated Transmittance equation used in NeRF modeling. In fact, both terms share the fundamental assumption that the degree of intensity reduction is proportional to both the current light intensity and the opacity at the current point.

The underlying intuition regarding light intensity in this modeling is further elucidated by the following derivation:

* Derivation of Beer-Lambert Law (which mirrors the derivation of Accumulated Transmittance)

$$ I_\text{gt}(\mathbf{r}) = I_o \cdot \exp \left ( - \int_{t_n}^{t_f} \rho(\mathbf{r}(t)) dt \right ) $$

$$ \rightarrow \ln(I_\text{gt}(\mathbf{r}) ) - \ln (I_o) = - \int_{t_n}^{t_f} \rho(\mathbf{r}(t)) dt $$

$$ \rightarrow \int_{t_n}^{t_f } \frac{1}{I(\mathbf{r}(t))} dI(\mathbf{r}(t)) = - \int_{t_n}^{t_f} \rho(\mathbf{r}(t)) dt $$

$$ \rightarrow \frac{1}{I(\mathbf{r}(t))} dI(\mathbf{r}(t)) = -\rho(\mathbf{r}(t)) dt $$

$$ \therefore \frac{d}{dt}I(t) = -I(t)\cdot \rho(t) $$

The reconstruction of a NeRF model from Tomography data necessitates the substitution of the conventional emission-absorption ray casting with a rendering equation grounded in the Beer-Lambert Law.

The ultimate intensity rendering term is expressed in a discretized form, analogous to that in NeRF, as follows:

$$ I_{pred}(\mathbf{r}) = I_0 \cdot \exp\left(-\sum_{i=1}^{N} \rho_i \delta_i\right) $$

2. Modeling

Returning our focus to NeRF, while variations such as Hash Grid NeRF and TensoRF exist, the fundamental principle of NeRF is to parameterize and represent a 3D scene.

When this parameterization is achieved through an MLP, it is classified within the NeRF family, whereas the utilization of explicit 3D Gaussians or 2D Gaussian surfels aligns it with the Gaussian Splatting methodology.

Conventionally, these parameter models are designed to receive a 3D Cartesian Coordinate $(x,y,z)$ as input, subsequently producing the density and color $(\sigma, c)$ at the specified location.

However, the rendering equation pertinent to Tomography necessitates the radiodensity $\rho$ as opposed to $(\sigma, c)$. Consequently, the application of NeRF to Tomography mandates a distinct modeling strategy compared to that employed in Photography.

Given that the rendering equation for Tomography, adapted based on the Beer-Lambert Law, exhibits sole dependency on radiodensity $\rho$, a Tomography-NeRF model can be architected to accept a 3D Cartesian Coordinate $(x,y,z)$ as input and exclusively generate $\rho$ as the output.

Subsequently, SAX-NeRF introduces an alternative design, favoring a Transformer architecture over an MLP to accommodate X-Ray specific attributes, and incorporating ray-wise locality inductive bias within the attention mechanism. However, considering this aspect to be of secondary importance to the central discussion, I will proceed without further elaboration. Readers seeking further details are encouraged to consult the original publication.

3. Quick Viewer Development Tip

3.1. Viser Viewer

While the code for SAX-NeRF is publicly available, it lacks an official viewer. Consequently, the basic visualization code provided is limited, necessitating the implementation of a viewer for more interactive exploration of the results.

To interactively visualize a NeRF/GS model, certain key elements are required:

Information about the viewing angle of the 3D scene ($[\mathbf{R|t}]$).
Rays generated from that angle.
Rendering (applying the rendering equation).

Typically, the implementation of elements 2) and 3) is less challenging, provided that the corresponding code is publicly accessible. However, camera coordinate systems can vary across projects, particularly between OpenGL and OpenCV, which often leads to errors. Therefore, it is crucial to generate rays specifically tailored to the project being implemented.

Another challenge lies in the difficulty of custom implementing 1) to control the scene as desired. Fortunately, viser, developed by the NeRFStudio team, offers a remarkably easy solution. A 2D Gaussian Splatting viewer I developed previously also utilized this viser project.

Below is a very simple SAX-NeRF viewer implemented using viser. This implementation utilizes the nerfview package, which provides minimal features within the viser project.

아래는 viser 를 이용해서 구현한 아주 간단한 SAX-NeRF viewer 이다. 아래 구현에서는 viser project 중에서도 minimal 한 feature 를 제공하고 잇는 nerfview package 를 이용하였다.

from typing import Tuple
import time
import viser
import nerfview
import tyro

import os
import torch
import numpy as np
import argparse
import matplotlib.pyplot as plt

from src.network import get_network
from src.encoder import get_encoder
from src.config.configloading import load_config
from src.render import render

def normalize(img):
    max_val = img.max()
    min_val = img.min()
    return (img - min_val) / (max_val - min_val)

@torch.no_grad()
def make_rays(K, c2w, img_wh, dsd=1.5, device='cuda'):
    H, W = img_wh
    pose = create_sax_pose_from_camera(c2w, DSO=1.0)
    i, j = torch.meshgrid(torch.linspace(0, W - 1, W, device=device),
                            torch.linspace(0, H - 1, H, device=device), indexing="ij")
    uu = (i.t() + 0.5 - W / 2) * 0.001
    vv = (j.t() + 0.5 - H / 2) * 0.001
    dirs = torch.stack([uu / dsd, vv / dsd, torch.ones_like(uu)], -1)
    rays_d = torch.sum(torch.matmul(pose[:3,:3], dirs[..., None]).to(device), -1)
    rays_o = pose[:3, -1].expand(rays_d.shape)

    rays = torch.cat([rays_o, rays_d,
                    torch.full_like(rays_o[..., :1], 0.904),
                    torch.full_like(rays_o[..., :1], 1.1)], dim=-1)
    return rays.reshape(-1, 8)

@torch.no_grad()
def create_sax_pose_from_camera(c2w: torch.Tensor, DSO: float):
    forward_vector = c2w[:3, 2]
    angle = np.arctan2(forward_vector[1], forward_vector[0])
    
    phi1 = -np.pi / 2
    R1 = np.array([[1.0, 0.0, 0.0],
                    [0.0, np.cos(phi1), -np.sin(phi1)],
                    [0.0, np.sin(phi1), np.cos(phi1)]])

    phi2 = np.pi / 2
    R2 = np.array([[np.cos(phi2), -np.sin(phi2), 0.0],
                    [np.sin(phi2), np.cos(phi2), 0.0],
                    [0.0, 0.0, 1.0]])

    R3 = np.array([[np.cos(angle), -np.sin(angle), 0.0],
                    [np.sin(angle), np.cos(angle), 0.0],
                    [0.0, 0.0, 1.0]])

    rot = np.dot(np.dot(R3, R2), R1)
    trans = np.array([DSO * np.cos(angle), DSO * np.sin(angle), 0])
    # rot = c2w[:3, :3].T
    # trans = np.array([DSO, DSO, 0])

    pose = np.eye(4)
    pose[:-1, :-1] = rot
    pose[:-1, -1] = trans

    return torch.tensor(pose, dtype=torch.float32, device='cuda')

class NerfViewer:
    def __init__(self, args):
        self.args = args
        os.environ["CUDA_DEVICE_ORDER"] = 'PCI_BUS_ID'
        os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu_id

        self.cfg = load_config(args.config)
        self.device = torch.device("cuda")
        self.dsd_value = 1.5
        self.clm_colors = torch.tensor(plt.cm.get_cmap("turbo").colors, device="cuda")

        self.network = get_network(self.cfg["network"]["net_type"])
        self.cfg["network"].pop("net_type", None)
        self.encoder = get_encoder(**self.cfg["encoder"])
        self.model = self.network(self.encoder, **self.cfg["network"]).to(self.device)
        self.model_fine = None
        n_fine = self.cfg["render"]["n_fine"]

        if n_fine > 0:
            self.model_fine = self.network(self.encoder, **self.cfg["network"]).to(self.device)

        ckpt = torch.load(args.weights)
        print(ckpt["epoch"])
        self.model.load_state_dict(ckpt["network"])

        if n_fine > 0:
            self.model_fine.load_state_dict(ckpt["network_fine"])
        self.model.eval()

        self.render_W = args.size
        self.render_H = args.size

    @torch.no_grad()
    def render_fn(self, camera_state: nerfview.CameraState, img_wh: Tuple[int, int]) -> np.ndarray:
        W, H = img_wh
        render_img_wh = (self.render_W, self.render_H)

        c2w = camera_state.c2w
        K = camera_state.get_K(img_wh)
        rays = make_rays(K, c2w, render_img_wh, self.dsd_value, device=self.device)
        chunk_size = 1048576
        num_rays = rays.shape[0]
        all_imgs = []

        for i in range(0, num_rays, chunk_size):
            start = i
            end = min(i + chunk_size, num_rays)
            rays_chunk = rays[start:end]
            rendered_chunk = render(rays_chunk, self.model, self.model_fine, **self.cfg["render"])["acc"]
            all_imgs.append(rendered_chunk)

        img = torch.cat(all_imgs, dim=0).reshape(self.render_H, self.render_W, 1)
        img = img.repeat(1, 1, 3)
        img = torch.nn.functional.interpolate(img.unsqueeze(0).permute(0, 3, 1, 2), size=(H, W), mode='bilinear', align_corners=False).squeeze(0).permute(1, 2, 0)
        img = (normalize(img.cpu().numpy()) * 255).astype(np.uint8)
        return img

    def update_dsd(self, value):
        self.dsd_value = value

def config_parser():
    cat = 'foot'
    parser = argparse.ArgumentParser()
    parser.add_argument("--gpu_id", default="0", help="gpu to use")
    parser.add_argument("--method", default=f"Lineformer", help="name of the tested method")
    parser.add_argument("--category", default=f"{cat}", help="category of the tested scene")
    parser.add_argument("--config", default=f"config/Lineformer/{cat}_50.yaml", help="path to configs file")
    parser.add_argument("--weights", default=f"pretrained/{cat}.tar", help="path to the experiments")
    parser.add_argument("--output_path", default=f"output", help="path to the output folder")
    parser.add_argument("--vis", default="True", help="visualization or not?")
    parser.add_argument("--size", default=256)
    return parser

def main(args):
    nerf_viewer = NerfViewer(args)
    with torch.no_grad():
        server = viser.ViserServer(verbose=True, port=9123, )
        _ = nerfview.Viewer(server=server, render_fn=nerf_viewer.render_fn, mode='rendering')
        dsd_val = server.add_gui_slider(
            "DSD",
            min=0.1,
            max=2.0,
            step=0.05,
            initial_value=nerf_viewer.dsd_value,
        )
        @dsd_val.on_update
        def _(_) -> None:
            nerf_viewer.update_dsd(dsd_val.value)
    while True:
        time.sleep(1.0)

if __name__ == "__main__":
    parser = config_parser()
    args = parser.parse_args()
    tyro.cli(lambda: main(args))

Captured Results:

As demonstrated, a viewer for a new NeRF / GS model can be quickly created with minimal coding. I initially considered uploading this to GitHub, but due to its simplicity, I will only include it in this blog post.

3.2. Marching Cube Extraction

Another method for interactively examining a NeRF / GS scene is by using scalar field to polygonal mesh conversion algorithms, such as marching cubes.

However, it is important to note that standard NeRF or GS models, which are not specifically designed for surface reconstruction like 2D GS or SDF, may not align perfectly with conventional mesh conversion algorithms. The density field obtained from NeRF or GS represents a volumetric density rather than a clear surface. Therefore, the extracted mesh might appear noisy or require careful tuning of the isosurface value.

This should be considered as a supplementary approach for gaining a rough geometric understanding of the scene.

Below is an example code snippet for generating a simple mesh from SAX-NeRF using scikit-image:

다음은 scikit-image 를 사용하여 SAX-NeRF 에서 간단한 mesh 를 생성하는 코드 snippet 의 예이다.

from skimage import measure
import trimesh

from src.config.configloading import load_config
from src.network import get_network
from src.encoder import get_encoder
from src.render import render, run_network

# init model 
'''
    load SAX NeRF model, see: test.py in SAX-NeRF
'''

# make voxel 
voxel_size = 128 
x, y, z = np.mgrid[:voxel_size, :voxel_size, :voxel_size]
x = (x - (voxel_size - 1) / 2) / (voxel_size / 2)
y = (y - (voxel_size - 1) / 2) / (voxel_size / 2)
z = (z - (voxel_size - 1) / 2) / (voxel_size / 2)
voxel = np.stack([x, y, z], axis=-1)
voxel /= 16

# marching cube
threshold = 0.4
voxel_estimated = run_network(voxel, model_fine if model_fine is not None else model, netchunk)
voxel_estimated = voxel_estimated.squeeze(dim=-1).cpu().numpy()
verts, faces, _, _ = measure.marching_cubes(voxel_estimated, level=threshold)
mesh = trimesh.Trimesh(vertices=verts, faces=faces)
mesh.export("output.obj")

Concluding Remarks

Through the exploration of SAX-NeRF, we've seen how the foundational principles of Neural Rendering can be adapted beyond the realm of traditional photography. By understanding the underlying physical phenomena, such as the Beer-Lambert Law in tomography, we can modify the rendering equation and model architecture to suit different imaging modalities. This adaptation underscores the abstract nature of Neural Rendering and its potential applicability across various scientific domains.

Ultimately, the ability to apply Neural Rendering to diverse data types and the ease with which we can now visualize these results opens up exciting possibilities for future research and applications across various fields.