Scaled IG

Abstract

While the field of inverse graphics has been witnessing continuous growth, techniques devised thus far predominantly focus on learning individual scene representations. In contrast, learning large sets of scenes has been a considerable bottleneck in NeRF developments, as repeatedly applying inverse graphics on a sequence of scenes, though essential for various applications, remains largely prohibitive in terms of resource costs. We introduce a framework termed "scaled inverse graphics", aimed at efficiently learning large sets of scene representations, and propose a novel method to this end. It operates in two stages: (i) training a compression model on a subset of scenes, then (ii) training NeRF models on the resulting smaller representations, thereby reducing the optimization space per new scene. In practice, we compact the representation of scenes by learning NeRFs in a latent space to reduce the image resolution, and sharing information across scenes to reduce NeRF representation complexity. We experimentally show that our method presents both the lowest training time and memory footprint in scaled inverse graphics compared to other methods applied independently on each scene. Our codebase is publicly available as open-source.

Method

Learning a large set of scenes. We learn a large set of scenes in the latent space of an autoencoder using a two-stage approach. Stage 1 jointly learns the encoder \(E_\phi\) and decoder \(D_\psi\) to optimally compress the training images \(x_{i,j}\), while learning a subset of scenes \(\mathcal{T}_1\). Stage 2 utilizes the components learned in the first stage to learn the rest of the scenes \(\mathcal{T}_2\). We represent each scene with a Tri-Plane \(T_i\) obtained by concatenating along the feature dimension "micro" planes \(T_i^\mathrm{mic}\) integrating scene-specific information and "macro" planes \(T_i^\mathrm{mac}\) encompassing global information. The micro planes \(T_i^\mathrm{mic}\) are independently learned for each scene. The macro planes \(T_i^\mathrm{mac}\) are computed from a set of shared base planes \(\mathcal{B}\) via a summation with weights \(W_i\). \(\mathcal{B}\) are jointly learned with all scenes, and \(W_i\) are learned specifically for each scene. We train a latent Tri-Plane \(T_i\) by matching its rendering \(\tilde{z}_{i,j}\) with the encoded image \(z_{i,j}\) via the reconstructive objective \(\mathcal{L}^\mathrm{(latent)}\). We also align the decoded scene renderings \(\tilde{x}_{i,j}\) with the ground truth RGB image via \(\mathcal{L}^\mathrm{(RGB)}\). \(\mathcal{L}^\mathrm{(ae)}\) is a auto-encoder reconstructive loss used in the first stage only.

Results

Resource Costs

Resource Costs. Comparison of resource costs and novel view synthesis (NVS) quality of recent works when naively scaling the inverse graphics problem (\(N = 2000\) scenes). Circle sizes represent the NVS quality of each method. Our method presents similar NVS rendering quality compared to Tri-Planes, our base representation, while demonstrating the lowest training time and memory footprint of all methods.

Comparison with classical Tri-Planes

Shapenet Cars Scenes

Tri-Planes trained in the latent space with a Micro-Macro decomposition.

Tri-Planes trained independently in the RGB space.

Basel Faces Scenes

Tri-Planes trained in the latent space with a Micro-Macro decomposition.

Tri-Planes trained independently in the RGB space.

BibTeX


      @article{scaled-ig,
        title={{Scaled Inverse Graphics: Efficiently Learning Large Sets of 3D Scenes}}, 
        author={Karim Kassab and Antoine Schnepf and Jean-Yves Franceschi and Laurent Caraffa and Flavian Vasile and Jeremie Mary and Andrew Comport and Valérie Gouet-Brunet},
        journal={arXiv preprint arXiv:2410.23742},
        year={2024}
      }