One of the core challenges of inserting virtual imagery into real-world scenarios, in augmented reality, is handling occlusions convincingly. Whether the synthetic object or person being injected into a real scene is generated by neural methods or through traditional CGI, the novel element needs to be hidden behind objects that are fully or partially ‘in front’ of it. Additionally, if the object (such as a synthetic car with transparent glass windows) contains complex holes, these must reveal the ‘real’ background in order to maintain the illusion of immersion into the scene.
This process is traditionally handled by essentially recreating the depth of the scene itself, assigning a z-order (or placement inside the depth layers) to the virtual object, and using those expensively-calculated front-most (occluding) objects as mattes to obscure the necessary parts of the virtual object.
However, a new collaboration from Niantic, the University of Edinburgh and University College London hopes to cut out much of the processing and resource overhead of this process by essentially guessing what the outcome of all this calculation might be, and using that guess to create a matte, without needing to create those complex depth estimations.
Left, the unoccluded footage; right, the object with occlusions applied. Source: https://nianticlabs.github.io/implicit-depth/
In tests, the new technique beats out Apple’s ARKit Lidar approach, as well as the conventional regression approach that maintains a parallel depth model, providing more accurate and responsive mattes in a real-time AR situation.
The advantage of guessing mattes is particularly notable when compared to ray-based methods such as Lidar, which require reflective data capture from the time of the source origination – whereas the new method, based entirely on understanding the conventions of visual behavior in both Lidar and regression scenarios, can create mattes arbitrarily, and without these aides.
Of their new method, the authors comment:
‘Given images of the real world scene and the depth map of the virtual assets, our network directly estimates the mask for compositing. The key advantage is that the network no longer has to estimate the real-valued depth for every pixel, and instead focuses on the binary decision: is the virtual pixel in front or behind the real scene here?
‘Further, at inference time we can use the soft output of a sigmoid layer to softly blend between the real and virtual, which can give visually pleasing [compositions], compared with those created by hard thresholding of depth maps.’
The approach, at its core, treats the original computation methods as mere data generation routines for a purely computer vision-based approach. The researchers observe that by framing the problem as a segmentation task, it’s possible to apply temporal smoothing to improve the overall output, and add:
‘Our ‘implicit depth’ model ultimately results in state-of-the-art occlusions on the challenging ScanNetv2 [dataset]. Further, if depths are needed too, we can compute dense depth by gathering multiple binary masks.
‘Surprisingly, this results in state-of-the-art depth estimation.’
Trafalgar Square in London gets flooded with an AR deluge whose depth is estimated rather than calculated, with the new method.
The new work is based on classification-based depth estimation, which attempts to calculate whether any particular pixel in a scene is in front of or behind the virtual object, rather than by using recreated depth maps as a kind of faux CGI model to accomplish this.
This particular task was of great interest to researchers at the tail-end of the 3D boom, since it’s one way of potentially recreating ‘twinned’ stereoscopic image pairs from a single image source.
The new system directly estimates depth and occlusion conditioned on both the RGB source images (i.e., in a webcam or live stream) and on the entire depth hierarchy provided by the virtual object.
It should be noted that the virtual object, whether CGI or neural, is naturally going to come with extra depth information; that this ‘true’ depth information is essentially ‘free’, because it takes no extra effort to generate it; and that the source images lack this depth information, and will be occluded based on estimates of pixel z-order, and not by comparing their (non-existent) depth data with the virtual object’s real depth data.
Ordinarily, problems like this are handled nowadays by a comparative approach, in that the system trains on paired images, such as CGI game footage and real-world driving footage.
In this case, that won’t work, because the end use-cases for the system are far too diverse to be covered by a broadly-generalized comparative model.
Therefore the researchers have split the task across two components. In the schematic below, we see, above, the traditional depth-based approach, where the RGB plate (i.e., ‘background’) image is depth-estimated, and the z-order of occluding objects calculated.
Instead (pictured below in the image), the new method passes the image data to a backbone net which extracts features which are then passed to a Mask Predictor module – a Multi-Layer Perceptron (MLP) that iterates through the features and predicts a depth approximation. This is then correlated to the known depth information of the virtual object, and the occlusion effected.
The MLP’s approach to segmentation is ‘inspired’, say the authors, by Facebook Research’s 2019 Detectron 2 PointRend image segmentation framework.
The system uses the calculated previous frame to inform the calculation of the next frame in the sequence, allowing for temporal smoothing, and aiding, to an extent, in continuity and coherence of predictions regarding the occlusion mattes. This aspect of the pipeline was inspired, the authors state, by two prior papers.
However, since neither of these datasets contain augmented virtual depths, it was necessary to synthesize these during the training.
Since occlusion events can be ambiguous, default results on binary pixel judgements (i.e., ‘is this pixel in front or behind the object?’) frequently produce a 0.5 estimations, which essentially equates to ‘We don’t know!’. To combat this, the researchers have addressed such ‘depth discontinuities’ in the ground truth depth maps through use of a Sobel operator. An L1 loss function is then applied to penalize such ambiguous predictions.
The authors found that their ‘cheat’ method can actually also be used to create a full-fledged depth map, by using the backbone multiple times to iterate through the guessing process, until a complete depth map is created.
Though this is an interesting development, potentially, for alternative systems that require depth maps in order to function, this novel approach to depth map generation is not particularly relevant to the scope of the project.
Training and Tests
The models and baselines were trained on Adam at a batch size of 24 across two (unspecified) GPUs, for 40,000 steps at an initial learning rate of 0.0001, with a progressive learning schedule. Standard flip and color augmentations were used, which randomly reverse the images and add color interpretations, both designed to make the final model robust to the vagaries of novel (i.e., unseen) data.
The backbone mentioned above is derived from the 2022 SimpleRecon framework, which builds a cost volume through plane sweeping. The results are then fed to a U-Net++ framework, originally designed for medical image segmentation, which outputs the final depth map.
The rival architectures were tested quantitatively on ScanNetV2, using the default training, testing and validation split. For qualitative tests, an additional model was trained on Hypersim. For visual occlusion comparisons, the system was tested against ARKit’s Lidar depth sensing, and also against Facebook’s Fast Depth Densification system.
Two variants of the backbone for the new framework were produced for the testing rounds, one based on SimpleRecon, and another, lighter version that uses fewer resources, based around a ResNet-18 encoder. Additionally, the authors tested a ResNet variant lacking a cost volume estimator, based on NianticLabs’ own MonoDepth2 estimator.
For the evaluation of occlusion quality during object insertion, Intersection-over-Union (IoU) was used, and comparative frameworks were base SimpleRecon, MonoDepth2, and the Niantic-led project ManyDepth.
Of these results, the authors state:
‘In all cases, our method improves occlusion scores, most notably in difficult cases near surfaces. Additionally, our lightweight ResNet variant yields strong [performance] while operating at a fraction of the compute time (20ms vs 64ms on an A100 GPU).’
Here the researchers comment:
‘Notably, when using [SimpleRecon] as a backbone, our method achieves a new state-of-the-art on ScanNetv2 in depth estimation, alongside our core occlusion-IoU evaluation.’
Regarding the qualitative comparisons (see image below), the authors observe:
‘Our occlusions are typically more realistic than baselines, in particular around soft edges, e.g. leaves. We also avoid catastrophic failures, e.g. around the bars in the final row.’
However, they also note that some interesting failure cases arose, due to limitations in the Hypersim training data. In one case, an occlusion failed to penetrate two layers of glass in a traditional London telephone booth:
The researchers hypothesize that later work could omit even more phases in the occlusion process, by directly calculating the final occluded image, instead of using faux, per-pixel depth estimation to shortcut a full depth-map process.
The new system proposed in this paper exemplifies the hope – terrifying to some – that complex workflows and methodologies could eventually become mere fodder for imitative pipelines that use them as training data and reproduce, or improve upon their results in a tacit and simplified manner*.
In this case, however, given that latency is a critical issue in AR occlusion scenarios, any process that can reduce the complexity of matte generation is likely to be generally well-received by the research sector and industry interests alike.
* (i.e., by predicting what the output of a process would look like and learning to generate that output, instead of following the original process.