Under the 3D: Geometrically Accurate 2D Gaussian Splatting

Although 3D Gaussian Splatting (3D GS) offers better portability to game and graphics engines compared to NeRF, it still faces significant challenges in mesh reconstruction.

This difficulty arises due to the nature of 3D GS, which resembles a variant of point cloud representation, making it inherently more complex to convert into a mesh than NeRF.

Recently, at SIGGRAPH 2024, "2D Gaussian Splatting for Geometrically Accurate Radiance Fields" stands out as it demonstrates practical usability in mesh generation through a splatting-based approach. Let’s take a look at what 2D GS is!

를 통해 NeRF 및 Radiance Fields 기술이 아직은 실사용에 어려운 단계라고 공유했었다.

특히 3D GS 의 경우, scene 자체의 게임/그래픽스/애니메이션 엔진으로의 이식 자체는 NeRF 보다 유리하지만 태생이 pointcloud 의 변형에 가까운 3D GS 의 특성상 mesh 로 만드는 것은 오히려 NeRF 보다도 더 어렵다고 했었는데...

이번 SIGGRAPH'24 에 공개된 논문 중 '2D Gaussian Splatting for Geometrically Accurate Radiance Fields' 는 Splatting 기반 연구로써 실제로 사용할 수 있을만한 품질을 보여주고 있어 오랜만에 논문을 리뷰하려 한다.

3D Gaussian Splatting is a technique that reconstructs a 3D scene using a set of anistropic & explicit primitive kernel.

3D Gaussians. The original authors of 3D GS defined the covariance matrix of these 3D Gaussians as a density function for a point $p$ in space, using the Gaussian rotation matrix $R$ and scale matrix $S$.

3D Gaussian Splatting 이란 anistropic & explicit 한 3D Gaussian 의 집합으로 3D Scene 을 reconstruction 하는 기술이다.

3D GS 원저자들은 학습의 용이성을 위해 3D Gaussians 의 covariance matrix 를 Gaussian 의 Rotation Matrix $R$ 과 Scale matrix $S$ 을 통해, 공간 위의 어떤 점 $p$ 에 대한 density function 을 다음과 같이 정의하였다.

3D GS 는 정의상 dense point cloud reconstruction 과도 비슷하지만,

novel view synthesis 를 위해 공간을 explixit radiance fields 로 재구성하며
sparsity 를 해결하기 위해 특정 iteration 마다 differentiable 3D GS 의 derivative 를 이용해 densification & removal 하는 refinement 전략을 채택하였다.

이같은 explicit representation 으로 얻게 되는 여러가지 장점이 있는데,

빠르다
MLP 에게 query 해서 정보를 얻어야하는 NeRF 와 다르게 (implicit) 3D GS 는 3D Gaussians 의 정보를 explicit 하게 가지고 있으므로, MLP query 없이 scene 을 굉장히 빨리 그릴 수 있다 (100fps 이상).

게임/그래픽스 엔진으로의 이식 용이성
3D GS의 rasterization 만 구현하면 되므로 게임 엔진으로의 이식에도 훨씬 유리하다. 엔진 뿐만 아니라 web viewer 등도 NeRF 대비 구현하기 훨씬 편리하다. (cf. SuperSplat)

편집의 용이성
학습된 3D GS scene 에서 특정 floater 만 선택해 지우거나, scene 의 일부분만 지우고 & 남기거나, 다른 3D GS scene 과 병합하여 한 합치는 등의 편집 용이성이 MLP 를 사용하는 NeRF 대비 훨씬 뛰어나다.

3D GS, by definition, is similar to dense point cloud reconstruction but reconstructs space as explicit radiance fields for novel view synthesis. This explicit representation offers several advantages:

Speed
Unlike NeRF, which requires querying a multi-layer perceptron (MLP) to obtain information, 3D GS stores the information of the 3D Gaussians explicitly, enabling real-time scene rendering at over 100 fps without the need to query an MLP.

Portability
Since 3D GS requires only rasterization, it is much easier to port to game engines and implement in web viewers compared to NeRF. (cf. SuperSplat)

Editability
3D GS allows for straightforward scene editing, such as selecting, erasing, or merging specific elements within a trained scene, which is more complex with NeRF due to its reliance on MLPs.

상기한 3D GS 의 여러 장점이 있지만, 알려진 3D GS 의 가장 큰 단점 중 하나는 surface reconstruction 이 어렵다 는 점이다.

2D GS 에서는 다음 4가지 근거를 통해 3D GS 에서 surface reconstruction 이 어려운 이유를 자세하게 서술하고 있다.

Thin Surface 를 배우기 어렵다.
three-dimensional scale 을 배우는 3D GS 의 volumetric radiance representation 은 thin surface 를 표현하기 어렵다.
Surface Normal 을 배우지 않는다.
surface normal 이 없어 high-quality surface 를 reconstruction 할 수 없다. (INN 에서는 SDF 등으로 이 단점을 해결한다)
Multi-View Consistency 가 부족하다.
3D GS 의 rasterization 은 각기 다른 viewpoint 에서 다양한 2D intersection surface 가 발생하는 문제가 생긴다. i.e., Artifacts!
Affine Projection 이 정확하지 않다
3D GS 를 radiance fiels 로 변환하는 데 사용되는 affine matrix 는 ($\Sigma' = JW\Sigma W^{\rm T}J^{\rm T}$) Gaussian center 에서 벗어나면 원근 정확도가 떨어진다. 이로 인해 종종 noise 가 많은 reconstruction 결과가 나타난다.

c.f. 짧게 첨언하자면, Jacobian $J$ 를 사용하는 affine projection 은 1st Taylor Approximation 이기 때문에 center point 에서 벗어날수록 projection error 가 커지게 된다.

또한 논문에는 언급되지 않았지만, 3D GS 는 Mesh Reconstruction 에도 어려움을 겪는다. NeRF와 마찬가지로 opacity 의 accumulation 으로 volume 을 표현하기 때문에 Marching Cube / Poisson Reconstruction 등의 방법으로 좋은 퀄리티의 mesh 를 생성하기 요원한 것.

Despite the advantages, 3D GS presents significant challenges in surface reconstruction. The paper on 2D GS discusses four main reasons why surface reconstruction is difficult in 3D GS:

Difficulty in Learning Thin Surfaces
The volumetric radiance representation of 3D GS, which learns the three-dimensional scale, struggles to represent thin surfaces accurately.
Absence of Surface Normals
Without surface normals, high-quality surface reconstruction is unattainable. While INN addresses this with Signed Distance Functions (SDF), 3D GS lacks this feature.
Lack of Multi-View Consistency
The rasterization process in 3D GS can lead to artifacts, as different viewpoints result in varying 2D intersection surfaces. i.e., Artifacts!
Inaccurate Affine Projection
The affine matrix used to convert 3D GS to radiance fields loses perspective accuracy as it deviates from the Gaussian center, often leading to noisy reconstructions.

Additionally, 3D GS shares NeRF's challenge of generating high-quality meshes through methods like Marching Cubes or Poisson Reconstruction, due to its volumetric opacity accumulation.

상기 Surface Reconstruction 의 어려움을 해결하기 위해 2) surface normal 관점으로 문제를 해결하려 한 previous work 이 있는데, 그 연구가 바로 concurrent work 로 소개하는 SuGaR 이다.

SuGaR 의 핵심 idea 는 바로

잘 훈련된 3D Gaussians 은, 가장 짧은 scale 을 갖는 axis 가 surface normal 과 평행할 것이다

라는 가정이다.

즉 앞서 정의한 3D Gaussians 를 다음과 같은 approximation 으로 대체할 수 있게 되고,

SuGaR (Surface-Aligned Gaussian Splatting) is a previous work that addresses some of the surface reconstruction challenges in 3D GS. SuGaR's core idea is based on the assumption that for well-trained 3D Gaussians, the axis with the shortest scale will align with the surface normal. This approximation is used as a regularization technique to ensure the 3D GS surface is aligned.

이러한 constraint 를 regularization 으로 활용하여 3D GS 를 surface aligned 하게 만든다.

하지만 SuGaR 는 3D GS 를 먼저 학습한 후 refinement 를 거치는 2-stage 이기 때문에 학습 방식이 복잡하며, surface reconstruction 어려움의 원인이었던 projection 의 부정확함에 대해서는 해결하지 못하기 때문에 SuGaR 를 custom scene 에 적용해보면 원하는 바 만큼의 깔끔한 geometry 로 mesh 를 생성하지 못하는 경우가 많았다.

The approach of 2D Gaussian Splatting (2D GS) essentially reverses the intuition behind SuGaR: instead of flattening 3D Gaussians to align them with surfaces, learns a scene composed of flat 2D Gaussians, known as surfels.

The Rotation Matrix $R$ and Scale Matrix $S$ for 2D GS can be defined accordingly.

2D Gaussian Splatting 의 접근법은 크게 보면 SuGaR 의 intuition 을 반대로 뒤집은 것에 불과하다. 즉 3D Gaussian 을 flat 하게 만들어서 surface 에 정렬시키지 말고 (SuGaR), 처음부터 flat 한 2D Gaussian (surfels) 로 이루어진 scene 을 학습시키자는 것이다.

따라서 우리가 배워야 할 Rotation Matrix $R$ 과 Scale Matrix $S$ 은 다음과 같이 정의할 수 있다.

2D Gaussian 은 원론적으로는 zero scale 만 추가하여 3D GS projection 을 그대로 사용할 수 있다. 하지만 앞서 언급한것처럼, 3D GS 의 affine projection $\Sigma' = JW\Sigma W^{\rm T}J^{\rm T}$ 은 1st Taylor Expansion 만을 사용하기 때문에 center point 에서 멀어질수록 approximation error 가 커지게 된다.

관련 내용을 언급한 official repo FAQ:

In principle, a 2D Gaussian can be used as a 3D GS projection by simply setting the third scale dimension to zero. However, the affine projection method used in 3D GS, based on the first-order Taylor expansion $\Sigma' = JW\Sigma W^{\rm T}J^{\rm T}$, induces approximation errors as the distance from the Gaussian center increases.

related discussion in official repo:

2D GS 의 저자들은 부정확한 3D GS 의 original projection 을 사용하는 대신, 2D Gaussians projection 으로 hoogeneous coordinates 를 이용한 일반적인 2D-to-2D mapping 을 사용할 것을 제안한다.

World-to-screen transformation matrix $\mathbf{W} \in \mathbb{R}^{4 \times 4}$ 에 대하여, sceen space (2D) 상의 point $(x,y)$ 는 다음과 같은 관계를 갖는다.

To address these inaccuracies, the authors of 2D GS propose using a conventional 2D-to-2D mapping in homogeneous coordinates.

For a world-to-screen transformation matrix $\mathbf{W} \in \mathbb{R}^{4 \times 4}$, a 2D point $(x,y)$ in screen space can be derived as follows:

저자들은 ray-splat 의 교점을 3개의 non-parallel plane ($uv$ plane, $x$-homogeneous plane, $y$-homogeneous plane) 의 교점을 구하는 방법으로 이를 해결한다.

Given image coordinate $(x,y)$ 에 대하여, 우리는 ray $\mathbf{x} = (x,y)$ 를 두 homogeneous $x$-plane $\mathbf{h}_x = (-1, 0, 0, x)$ 와 $y$-plane $\mathbf{h}_y = (0, -1, 0, y)$ 사이의 교선으로 정의할 수 있으며, world space 에서 정의된 homogeneous plane $\mathbf{h}_x$ 와 $\mathbf{h}_y$ 를 $uv$-space 상으로 tranform 하여 교점을 구할 것이다.

$x$, $y$ homogeneous plane 을 $uv$ space 상의로 transform 하여 구한 두 plane $\mathbf{h}_u$, $\mathbf{h}_v$ 은 다음과 같으며,

The authors resolve the ray-splat intersection problem by identifying the intersection of three non-parallel planes.

For a given image coordinate $(x, y)$, a ray is defined as the intersection between two homogeneous planes, $\mathbf{h}_x = (-1, 0, 0, x)$ and $y$-plane $\mathbf{h}_y = (0, -1, 0, y)$. And the intersection is calculated by transforming the homogeneous planes $\mathbf{h}_x$, $\mathbf{h}_y$ to $uv$-space.

상기 공식을 통해 screen pixel $(x,y)$ 에 대한 $uv$-space 에의 projection value 를 알 수 있으며, 앞서 정의한 수식 $(xz, yz, z, z)^{\rm T} = \mathbf{W} P(u, v)$ 을 통해 depth $z$ 도 얻을 수 있다.

논문 supplementary material 에 2D GS 의 2D-to-2D projection 과 3D GS 의 affine projection 을 비교한 figure 를 볼 수 있는데, 확실히 homogeneous projection 이 더 정확한 모습을 보여주고 있다.

This closed-form solution provides the projection value from the $uv$-space for the screen pixel, with the depth $z$ obtained from a previously defined equation.

Comparative figures in the supplementary material of the paper demonstrate that the homogeneous projection method in 2D GS is more accurate than the affine projection used in 3D GS.

NeRF 와는 다르게, 3D GS 의 volume rendering 은 교차하는 splats 간의 거리 차이를 고려하지 않는다. (NeRF 는 $\delta_i = t_{i+1} - t_i$ 로 sampling point 간 거리를 고려한 rendering 을 계산한다. cf. discretizatized volume rendering in NeRF)

따라서, 널리 퍼진 gaussian splats 들은 비슷한 color 와 depth 를 가질 수 있으며, 이는 ray 가 first visible surface 만 정확히 한 번 교차해야 하는 surface reconstruction 을 어렵게한다.

이 문제를 완화하기 위해, 저자들은 Mip-NeRF360 와 비슷하게 ray weight distribution 을 ray-splat 교점 근처로 집중시키는 depth-distortion loss 를 제시하였다.

Unlike NeRF, where volume rendering accounts for distance differences between intersecting splats, 3D GS does not, leading to challenges in surface reconstruction.

To address this, the authors introduce a depth distortion loss, which concentrates the ray weight distribution near the ray-splat intersection, similar to Mip-NeRF360.

weight 에 대한 정의를 보면 NeRF 의 accumulated transmittance 와 같은 식임을 알 수 있는데, 같은 논리로

point 가 현재 ray direction 을 따라 투명하면서
point 의 opacity 값이 높을 때

큰 값을 나타내는 weight 가 된다. 즉 해당 loss 는, opacity 가 높은 ray-splats 교점들의 깊이 차이를 줄이도록 하는 regularization 이 된다.

또한 이 Loss 구현에 대해서 appendix 에 자세하게 언급되어 있는데,

The definition of weight shows that it is the same expression as the accumulated transmittance of the NeRF, and by the same logic, when a point is transparent along the current ray direction and the opacity value of the point is high, the weight will be a large value.

In other words, the loss is a regularization that reduces the depth difference between ray-splat intersections with high opacity.

Depth-Distortion Loss 에 더불어, 모든 2D splats 이 실제 surface 와 정렬되도록 하는 normal-consistency loss 를 제시한다.

Volume Rendering 은 반투명한 여러 2D Gaussians (surfels) 가 ray 를 따라 존재할 수 있기 때문에, 저자들은 accumulated opacity 가 0.5 에 도달하는 부분을 실제 surface 라고 간주하였다.

그리고 이 부분에서의 surface normal 과 depth 의 derivative 를 align 하는 normal consistency loss 를 다음과 같이 제안한다.

In addition to distortion loss, 2D GS presents normal consistency loss, which ensures that all 2D splats are aligned with the real surface.

Since Volume Rendering allows for multiple translucent 2D Gaussians (surfels) to exist along a ray, the authors consider the area where the accumulated opacity reaches 0.5 to be the true surface.

They propose a normal consistency loss that aligns the derivative of the surface's normal and depth in this region as follows:

where

$i$ is the index of the splats intersecting along the ray
$\omega_i$ is the blending weight of ray-splat intersections
$\mathbf{n}_i$ is the normal vector of splats
$\mathbf{N}$ is the normal vector estimated from points in the neighboring depth map.

Specifically, $\mathbf{N}$ is computed using finite difference as follows.

여기서,

$i$ 는 ray 을 따라 교차하는 splats 의 index
$\omega_i$ 는 ray-splat 교점의 blending weight
$\mathbf{n}_i$ 는 splat 의 normal vector
$\mathbf{N}$ 은 인근 depth map 의 point $\mathbf{p}$ 에서 추정된 normal vector 이다

구체적으로, $\mathbf{N}$은 finite difference 을 사용하여 다음과 같이 계산된다.

논문에서 드러나는 타 논문 대비 2D GS 의 surface recon / mesh recon 성능은 이전 연구 그 무엇과도 비교를 불허한다.

또한 이러한 성능이 실제 custom scene 에도 그대로 적용된다. 아래는 개인적으로 가지고 있는 두 가지 object 를 실제로 촬영하고 2D GS 를 통해 mesh export 해본 실험 결과이다.

This high performance extends to real-world custom scenes. Experimental results on two custom objects, a guitar, and a penguin.

이 정도의 성능이라면 light-condition disentanglement 문제만 어느 정도 해결된다는 가정 하에 실제 게임/모델링 등의 작업에도 사용할 수 있을만한 퀄리티라고 생각된다.

또한 2D GS 는 depth 값을 (비교적) 정확하게 뽑을 수 있기 때문에, estimated depth 을 사용하여 TSDF reconstruction 으로 mesh 를 빠르게 생성할 수 있다. (실험했을 때는 1분 이내였다)

cf. 해당 논문이 발표된 SIGGRAPH'24 에 정확하게 같은 idea 로 발표된 연구 Gaussian Surfels 도 있지만, 3rd axis 를 1st, 2nd axis 의 cross-product 로 사용하기보단 따로 3rd axis 를 배우되, scale 만 0으로 표현하여 original 3D GS 의 rasterization 을 그대로 사용한다. 즉 affine projection error 를 해결하지 못한다. 실제 알고리즘을 테스트 해봤을 때도 2D GS 가 Gaussian Surfels 에 비해 월등한 성능을 보여주었다.

It demonstrates that 2D GS can generate meshes of sufficient quality for use in games and modeling, provided that issues related to light condition disentanglement are addressed.

Moreover, 2D GS's ability to accurately extract depth values enables quick mesh generation through Truncated Signed Distance Function (TSDF) reconstruction, with our experiments completing the process in under a couple of minutes.

cf. Gaussian Surfels , which was presented at SIGGRAPH'24 uses the same idea. However, instead of using the 3rd axis as a cross-product of the 1st and 2nd axes, it learns the 3rd axis separately but uses the rasterization of the original 3D GS with scale equal to 0. This means that it does not solve the affine projection error. When we tested the algorithm in practice, we found that 2D GS outperformed Gaussian Surfels.

~~2D GS 연구에 한 가지 아쉬운 점은, 현재로서는 official viewer 를 제공하지 않는다는 점이다.~~ (24.06.10 부터는 SIBR Viewer 를 제공하는 중이다)

2D GS 의 mesh export 가 TSDF 를 사용하기 때문에 mesh 가 꽤나 빠르게 뽑히지만, truncation distance 에 따라 clustering 이 잘못되어 나오는 mesh 가 생기는 등, 일부 hyperparamter tunning 이 필요하다.

그런데 mesh 에 surface artifacts 가 존재하는 경우, 실제 scene 학습이 잘못된 것인지 truncation distance 를 튜닝해야 하는 문제인지 판별하기 어렵다.

2D GS ply 파일에 additional scale dimension 을 추가하여 해당 값을 0으로 할당하면 3D GS viewer 를 그대로 사용할 수 있지만, 2D GS 저자들의 main contribution 중 하나인 정확한 gaussian projection 을 사용할 수 없다.

따라서 viser 를 이용해서, 2D GS 의 homogeneous projection 을 사용하여 2D GS ply 파일을 볼 수 있는 custom viewer 를 만들어 보았다.

~~One limitation of 2D GS is the lack of an official viewer.~~ (as of 24.06.10, SIBR Viewer is available).

If you add scale dimension to the 2D GS ply file and assign its value to 0, you can still use the 3D GS viewer, but you will not be able to use an accurate Gaussian projection, which is one of the main contributions of the 2D GS authors.

To overcome this limitation, I developed a custom viewer using Viser, which supports the homogeneous projection of 2D GS, eliminating projection errors. This viewer offers various visualization and editing functions, making it easier to monitor scene training and generate rendering camera paths.

당장 몇개월 전만 해도 아직 NeRF / 3D GS 의 실사용은 힘든 단계라고 글을 썼던 것이 무색하게 좋은 알고리즘이 공개되었다.

단순히 3D GS 를 flat 하게 핀 것에 불과한 Gaussian Surfels 연구에 비해서도, projection 의 부정확함, rasterization 등을 섬세하게 고려해서 잘 설계한 알고리즘이라는 생각이 든다.

2D GS 는 3D GS 가 가지는 explicit representation 의 장점을 그대로 계승하면서도, 3D GS 가 갖는 surface reconstruction 의 어려움, projection error 등을 해결한 진일보한 연구이다.

또한 요새 연구들이 project page 만들기에 공을 들이는 것에 비해 실제 test 결과는 project page 결과를 따라가지 못할 때도 많은데, 실제 custom scene 에도 공개된 것 정도의 성능을 보여주는 것도 만족스러웠다.

In conclusion, the introduction of 2D Gaussian Splatting marks a significant advancement in the practical use of Radiance Fields.

Building upon the explicit representation of 3D GS, 2D GS not only addresses previous challenges in projection inaccuracies and rasterization but also demonstrates superior performance in both synthetic and real-world scenarios. This algorithm represents a well-designed approach to mesh generation, with promising applications in games and modeling.

Under the 3D: Geometrically Accurate 2D Gaussian Splatting

Introduction

Challenges in Mesh Reconstruction for Radiance Fields

Recap. Radiance Fields 의 Mesh Recon 의 어려움

1. Background

1.1. 3D Gaussian Splatting

1.2. Surface Reconstruction Problem in 3D GS

1.3. SuGaR: Surface-Aligned Gaussian Splatting

2. 2D Gaussian Splatting

2.1. 2D Gaussian Modeling (Gaussian Surfels)

2.2. Splatting

2D-to-2D Projection

Ray-Splat Intersection w/ Homography

2.3. Training 2D GS

Depth Distortion Regularization

Normal Consistency Regularization

3. Experimens & Custom Viewer

3.1. Qualitative Results

3.2. Custom Viser Viewer for 2D Gaussian Splatting

⭐ Github Project Link

4. Conclusion