NeRF Editing and Inpainting Techniques: Training View Pre-processing

cover
18 Jul 2024

Authors:

(1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk);

(2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk);

(3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk);

(4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk);

(5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu).

Abstract and 1. Introduction

2. Related Work

2.1. NeRF Editing and 2.2. Inpainting Techniques

2.3. Text-Guided Visual Content Generation

3. Method

3.1. Training View Pre-processing

3.2. Progressive Training

3.3. 4D Extension

4. Experiments and 4.1. Experimental Setups

4.2. Ablation and comparison

5. Conclusion and 6. References

3.1. Training View Pre-processing

Text-guided visual content generation is inherently a highly underdetermined problem: for a given text prompt, there are infinitely many object appearances that are feasible matches. In our method, we generate content in NeRF by first performing inpainting on its training images and then backward propagating the modified pixels of the images into NeRF.

If we inpaint each view independently, the inpainted content will not be consistent across multiple views. Some prior 3D generation works, using techniques including score distillation sampling [21], have demonstrated the possibility of multiview convergence based on independently modified training views. Constraining the generation problem by enforcing the training views to be strongly related to each other will simplify convergence to a large extent. Therefore, we first inpaint a small number of seed images associated with a coarse set of cameras that covers sufficiently wide viewing angles. For the other views, the inpainted content are strongly conditioned on these seed images.

Figure 1. Baseline Overview. Our generative NeRF inpainting is based on the inpainted image of one training view. The other seed images and training images are obtained by using stable diffusion to hallucinate the corrupted detials of the unproject-projected raw image. These images are then used to finetune the NeRF, with warmup training to get geometric and coarse appearance convergence, followed by iterative training image update to get fine convergence. For 4D extension, we first obtain a temporally consistent inpainted seed video based on the first seed image. Then for each frame, we infer inpainted images on other views by projection and correction, as in our 3D baseline.

Stable diffusion correction. Since our planar depth estimation is not always accurate, while the raw projected results are plausible in general, the many small artifacts make the relevant images unqualified for training. To make them look more natural, we propose to cover the projection artifacts with stable diffusion hallucinated details. First, we blend the raw projection with random noise in stable diffusion’s latent space for t timesteps, where t is relatively small to the total number of stable diffusion timesteps. Then, starting from the last t steps, we denoise with stable diffusion to generate the hallucinated image. The resulting image is then regarded as the initial training image. With this step, the training image pre-processing stage is completed.

This paper is available on arxiv under CC 4.0 license.