Why Positional Encoding Makes NeRF more Powerful


Posted by Hwan Heo on November 28, 2021

TL;DR

In this article, we explore why positional encoding increases NeRF's high-fidelity reconstruction ability via exploring the paper: Fourier Features Let Networks Learn High-Frequency Functions in Low-Dimensional Domains

By leveraging a Neural Tangent Kernel (NTK) theory, the authors demonstrate that Fourier features improve the convergence and performance of neural networks on these complex tasks.

1. Introduction

Fourier-featuring is a function that embeds a coordinate space point into frequency space.

A prominent example in deep learning is 'Positional Encoding', which uses sinusoidal functions to embed coordinate space into frequency space, thereby incorporating positional information that Networks cannot capture.

Gaussian RT
Figure 1. Coordinate-Based MLPs

Building upon NTK theory, this article foucuses on the theoretical investigation of how neural networks process coordinate information through Fourier-featuring, especially for the coordinate-based MLPs, which map dense, continuous low-dimensional input to the high-dimensional output (e.g., NeRF).

2. Background


2.1. Kernel Trick

Gaussian RT
Figure 2. illustration of Kernel-Trick

For a linearly inseparable data point $x$, let $\phi (x)$ be a non-linear mapping function that makes $\phi (x)$ linearly separable.

The kernel trick performs kernel regression without explicitly finding the feature map by defining the kernel as follows:

$$K(x, \ x') = \phi(x) ^T \phi(x') $$

This approach is interpreted as using a feature map $\phi$ with desirable properties through the kernel, rather than mapping input $x$ and then taking the inner product.

2.2. Neural Tangent Kernel

Neural Tangent Kernel (NTK) theory describes the gradient descent-based training of deep neural networks with infinite width through kernel regression, aiming to explain neural networks using the kernel trick.

Linearization of NN Training & Kernel

A neural network can be represented by the linearization:

$$f(w, \ x) \simeq f(w_0 , \ x) \ + \ \nabla _w f( w_0 , \ x) ^T (w - w_0 ) $$

This Taylor expansion has the following properties:

  1. It is linear with respect to the weights $w$.
  2. It is non-linear with respect to $x$.

The gradient term $\nabla _w f( w_0 , \ x) ^T (w - w_0 )$ acts as a feature map that maps a non-linear data point $x$ to a useful space.

The corresponding kernel $K$ is defined as follows:

$$K(x, \ x') = h_\text{NTK} = \{ \phi(x) , \ \phi (x') \} = \nabla _w f(w_0 , \ x) ^T \ \nabla _w f(w_0, \ x' ) $$

Gradient-Based Training & Kernel Regression

The NTK can be found through gradient descent in the neural network. For a timestep $t$, gradient descent is expressed as:

$$w(t+1) \ = \ w(t) - \eta \nabla _w l. $$

Subsequently, this can be derived as follows:

$$ { w(t+1) \ - \ w(t) \over \eta } = -\nabla _w l \simeq {dw \over dt}.$$

With least squares (MSE) as the loss function,

$$l(w) = {1 \over 2} \| f(w, x ) - y \|^2,$$

the gradient term $\nabla l$ with respect to the $w$ can be derived as

$$ \nabla _w l= \nabla _w |f(w, x) - y |. $$

Therefore, Neural network training via optimization can be represented by NTK kernel regression:

$$\begin{aligned} {d \over dt } y(w) &= \nabla _w f(w, x) ^T \cdot {d \over dt }w \\ &= - \nabla _w f(w, x) ^T \cdot \nabla _w f(w, x) (f(w, x) - y) \\ &= -h_{\text{NTK}} (f(w,x) -y ) \end{aligned}$$

Let $u=y(w)-y$, then the output residual at training iteration $t$ can be written as:

$$ u(t) = u(0) \exp (-\eta h_{\text{NTK}} t ) $$

2.3. Spectral Bias of DNNs

Based on the NTK approximation, the network's prediction after $t$ iterations for test data $\mathbf X_\text{test}$ is:

$$ \hat{\mathbf{y}}^{(t)} \simeq \mathbf{K}_{\text{test}} \mathbf{K}^{-1} ( \mathbf{I} - e^{-\eta \mathbf{K} t} ) \mathbf{y}$$

For ideal training, $\mathbf K_\text{test} = \mathbf K$. i.e., equivalent to the last equation in 2.2.2.

By eigendecomposing $\mathbf K = \mathbf Q \mathbf \Lambda \mathbf Q^{\rm T}$, we obtain:

$$\begin{aligned} \mathbf{Q}^{\rm T} (\hat{\mathbf{y}}^{(t)} - \mathbf{y}) &\simeq \mathbf{Q}^{\rm T} ( \mathbf{K}_{\text{test}} \mathbf{K}^{-1} ( \mathbf{I} - e ^{-\eta \mathbf{K} t} ) \mathbf{y} - \mathbf{y} )) \\ & \simeq \mathbf{Q}^{\rm T} ( ( \mathbf{I} - e ^{-\eta \mathbf{K} t} ) \mathbf{y} - \mathbf{y} )) \\ & \simeq - e ^{-\eta \mathbf{\Lambda} t} \mathbf{Q}^{\rm T} \mathbf y \quad (\because e ^{-\eta \mathbf{K} t} = \mathbf{Q} e ^{-\eta \mathbf{\Lambda} t} \mathbf{Q}^{\rm T} ) \end{aligned}$$

In the above Equation, the exponential decay term decreases with the eigenvalue. It means larger eigenvalues are learned first.

For example, in case of the image, large eigenvalues (in spectral domain) correspond to contours, so convergence to high-frequency components is slow without embedding in NeRF.

3. Fourier Features for a Tunable Stationary Neural Tangent Kernel

This section explores how Fourier Features embedding in the kernel space can address convergence issues for high-frequency components.

3.1. Fourier-Featuring

The Fourier-Feature mapping function $\gamma$ is defined as:

$$\gamma (v) \ = \ \big [a_1 \cos (2 \pi b_1 ^T v), \dots , a_m \cos (2 \pi b_m ^T v), \ a_m \sin (2 \pi b_m ^T v ) \big ]^T $$
  • Positional Encoding in Transformers: Adds spatial information to features in attention-based architectures, defined as: $a_i =1, \ b_i = 10000^{i / d} , \ d : \text{dimension}$
  • Positional Encoding in NeRF: Provides even distribution of low & high-frequency information in the input, defined as: $a_i =1, \ b_i = 2^{i} {}$

The kernel induced by this mapping function is:

$$\begin{aligned} K (\gamma (v_1 ) , \ \gamma (v_2) ) &= \gamma (v_1 ) ^T \gamma (v_2) \\ &= \sum _{j=1}^m a^2 _j \cos (2 \pi b_j ^T (v_1 -v_2) ) = h_\gamma (v_1 - v_2 ) \end{aligned}$$
  • remember: $\cos (\alpha - \beta ) = \cos \alpha \cos \beta \ + \ \sin \alpha \sin \beta$

This Fourier-feature kernel is a stationary function, meaning it is translation-invariant:

$$h_\gamma( (v_1 +k ) - (v_2 +k ) ) = h_\gamma (v_1 - v_2 ) $$

Coordinate-based MLPs use dense and uniform coordinate points as input. These must be isotropic to ensure global performance, meaning features should be extracted in all directions, not just specific ones.

This is why stationary properties that are location-invariant can improve performance. Positional encoding treats all equally distant relations from the coordinate system uniformly, enabling effective high-dimensional space reconstruction.

3.2. NTK Kernel with Fourier-Featuring

The NTK Fourier-featured kernel is:

$$K( \phi \circ \gamma (x) , \ \phi \circ \gamma (y) ) $$

Stationary kernel regression here equates to convolutional filtering with reconstruction, as the neural network approximates the convolution between synthetic kernels $K_\text{NTK}$ and $K_\gamma$ on data points $v_i$ and weights $w_i$.

Thus, the Fourier feature represented by NTK theory is:

$$f = ( h_\text{NTK} \circ h_\gamma ) * \sum_{i=1}^n w_i \delta _{v_i}$$

where $\delta$ represents the direction delta.

This expression indicates:

  1. A stationary filter $h_\gamma$ extracts information in a location-invariant manner.
  2. Convolution, being the inverse Fourier transform of multiplication in frequency space, allows extraction of features across different frequencies in a multifaceted (yet location-invariant) way through components of specific frequencies directly embedded in $h_\gamma$.
  3. A Neural Network, receiving Fourier-featured input, is equivalent to performing kernel regression by combining NTK and a stationary kernel.

You may also like,