Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

ICML 2023
Naver AI Lab, Korea University

Motivation


Multi-Resolution hash encoding or Instant-NGP has recently been proposed to reduce the computational cost of NeRF's. However, we found that a naive gradient-based camera pose refinement method severely deteriorates performance. Showing that the oscilating gradient flows of hash encoding interfere with the registration of camera poses, our method addresses the issue by utilizing smooth interpolation weighting to stabilize the gradient oscilation for the ray samplings across has grids.


Method


Multi-Resolution Hash Encoding

  1. A positional encoding uses the hash tables for multi-resolution features.
  2. Each feature is the tri-linear interpolation of the eight-corner entries in a grid cube using proportional weights depending on the location of a given point.
  3. Cannot back-propagate through the hash entries due to random hashing, but through the weights, where the gradients are discontinuous across the grids.

Pros. faster convergence with better accuracy

Cons. back-propagation through ray-sampled positions is unstable!



Smooth gradients for unstable back-propagation



The Derivative of Multi-Resolution Hash Encoding

Using this relation and the appropriate choice of the indices, the $k^{\text{th}}$ element of Jacobian $\nabla_{\mathbf{x}}\mathbf{h}_{l}(\mathbf{x})$ can be rewritten as follows:

$$\begin{aligned} \nabla_{\mathbf{x}}\mathbf{h}_{l}(\mathbf{x}) &= \left[ \frac{\partial {\mathbf{h}_{l}}(\mathbf{x})}{\partial {x}_1}, \dots, \frac{\partial {\mathbf{h}_{l}}(\mathbf{x})}{\partial {x}_d} \right] \\ &= \sum_{i=1}^{2^{d}} \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{i,l} (\mathbf{x}) \big) \big) \cdot \left[ \frac{\partial {{w}_{i,l}}(\mathbf{x})}{\partial {x}_1}, \dots, \frac{\partial {{w}_{i,l}}(\mathbf{x})}{\partial {x}_d} \right]. \end{aligned}$$

Let $\bar{i}$ be one of the nearest corner indices from $\mathbf{c}_{i,l}$ in a unit hypercube, where $\mathbf{c}_{i,l}$ and $\mathbf{c}_{\bar{i},l}$ make an edge of the unit hypercube. Among the $2^d$ corners, we have $2^{d-1}$ pairs like that. Then, we have the relation for $w_{\bar{i}_k,l}$ as follows:

$$ \begin{aligned} \frac{\partial {{w}_{\bar{i}_k,l}}(\mathbf{x})}{\partial {x}_k} = - \frac{\partial {{w}_{i,l}}(\mathbf{x})}{\partial {x}_k}, \end{aligned}$$

which can be inferred from weight definition, since the relative positions of $\mathbf{x}$ are different for the two cases.

Using this relation and the appropriate choice of the indices, the $k^{\text{th}}$ element of Jacobian $\nabla_{\mathbf{x}}\mathbf{h}_{l}(\mathbf{x})$ can be rewritten as follows:

$$ \begin{aligned} \frac{\partial {\mathbf{h}_{l}}(\mathbf{x})}{\partial {x}_k} &= \sum_{i=1}^{2^{d}} \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{i,l} (\mathbf{x}) \big) \big) \cdot \frac{\partial {{w}_{i,l}}(\mathbf{x})}{\partial {x}_k} \\ &= \sum_{i=1}^{2^{d-1}} \left( \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{i,l} (\mathbf{x}) \big) \big) - \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{\bar{i}_k,l} (\mathbf{x}) \big) \big) \right) \cdot \frac{\partial {{w}_{i,l}}(\mathbf{x})}{\partial {x}_k} \\ &= \sum_{i=1}^{2^{d-1}} \left( \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{i,l} (\mathbf{x}) \big) \big) - \mathcal{H}_{l}\big( h_{l} \big( \mathbf{c}_{\bar{i}_k,l} (\mathbf{x}) \big) \big) \right) \cdot \prod_{j \neq k} \left( 1 - | \mathbf{x}_{l} - \mathbf{c}_{i,l}(\mathbf{x}) |_j \right ), \end{aligned}$$

where $\prod_{j \neq k} \left( 1 - | \mathbf{x}_{l} - \mathbf{c}_{i,l}(\mathbf{x}) |_j \right )$ and the differences between the hash table entries are constant to the $x_k$, which make ${\partial {\mathbf{h}_{l}}(\mathbf{x})} / {\partial {x}_k}$ is constant along with the $k^{\text{th}}$ axis of the unit hypercube. Notice that the last term can be seen as the weights defined as:

$$ \begin{aligned} \sum_{i=1}^{2^{d-1}} \prod_{j \neq k} \left( 1 - | \mathbf{x}_{l} - \mathbf{c}_{i,l}(\mathbf{x}) |_j \right ) = 1,\end{aligned}$$

where ${\partial {\mathbf{h}_{l}}(\mathbf{x})} / {\partial {x}_k}$ is the convex combination of the differences between two hash table entries.

Therefore, we change the original interpolation to have infintie-differentiable smooth gradient using cosine function,
$$\delta(w_{i,j}) = \frac{1-\cos(\pi w_{i,j})}{2} \quad \nabla_{x} \delta(w_{i,j}) = \frac{\pi}{2} \sin (\pi w_{i,j}) \cdot \nabla_{x} w_{i,j} $$

where $w_{i, j}$ is the weight for the $i$-th corner and the $l$-th level resolution.


Straight-through forward function

Furthermore, since the non-linear interpolation can hinder the original performance of the NGP, we propose to use the mix-up of tri-linear interpolation and smooth gradients:

$$\hat{w}_{i, j} = w_{i,j} + \lambda \delta(w_{i,j}) - \lambda \tilde{\delta}(w_{i,j})$$

where $\lambda$ is a hyper-parameter, denotes the detached variable from the computational graph.


Experiments


Visualization of the progress of pose refienments

Red camera is GT and red line is a pose error vector.


Training time per iteration

  • Inherit Instant-NGP's faster convergence
  • Inherit Instant-NGP's better accuracy
  • Improve stability of pose refinement

Quantitative Results

For the Synthetic (Blender) dataset, our reimplementation utilizing the below tiny-cuda-nn and ngp_pl frameworks demonstrates remarkable superiority, achieving a score of 31.54, surpassing the paper's reported score of 29.86, as well as the scores of 28.96 achieved by GARF and 28.84 achieved by BARF.