Faking Depth Occlusion for Better Augmented Reality

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

One of the core challenges of inserting virtual imagery into real-world scenarios, in augmented reality, is handling occlusions convincingly. Whether the synthetic object or person being injected into a real scene is generated by neural methods or through traditional CGI, the novel element needs to be hidden behind objects that are fully or partially ‘in front’ of it. Additionally, if the object (such as a synthetic car with transparent glass windows) contains complex holes, these must reveal the ‘real’ background in order to maintain the illusion of immersion into the scene.

In this case, a 'virtual' toy car must be concealed behind a real-world box in an augmented reality scenario. The occluded mesh is indicated in red. Source: https://marciocerqueira.github.io/docs/publications/2021-TVCG.pdf
In this case, a 'virtual' toy car must be concealed behind a real-world box in an augmented reality scenario. The occluded mesh is indicated in red. Source: https://marciocerqueira.github.io/docs/publications/2021-TVCG.pdf

This process is traditionally handled by essentially recreating the depth of the scene itself, assigning a z-order (or placement inside the depth layers) to the virtual object, and using those expensively-calculated front-most (occluding) objects as mattes to obscure the necessary parts of the virtual object.

Simultaneous Localization and Mapping (SLAM) is just one of a series of complex and resource-draining stages in a popular 2018 approach to generating depth maps that can be used to infer a z-index (and occlusion) in AR scenarios. Source: https://dl.acm.org/doi/pdf/10.1145/3272127.3275083
Simultaneous Localization and Mapping (SLAM) is just one of a series of complex and resource-draining stages in a popular 2018 approach to generating depth maps that can be used to infer a z-index (and occlusion) in AR scenarios. Source: https://dl.acm.org/doi/pdf/10.1145/3272127.3275083

However, a new collaboration from Niantic, the University of Edinburgh and University College London hopes to cut out much of the processing and resource overhead of this process by essentially guessing what the outcome of all this calculation might be, and using that guess to create a matte, without needing to create those complex depth estimations.

Left, the unoccluded footage; right, the object with occlusions applied. Source: https://nianticlabs.github.io/implicit-depth/

In tests, the new technique beats out Apple’s ARKit Lidar approach, as well as the conventional regression approach that maintains a parallel depth model, providing more accurate and responsive mattes in a real-time AR situation.

Apple's ARKIt leaves the object intact, but poorly-matted; regression mangles the matte; and the new approach produces smooth and accurate matted areas that reveal the object in a plausible manner. Source: https://arxiv.org/pdf/2305.07014.pdf
Apple's ARKIt leaves the object intact, but poorly-matted; regression mangles the matte; and the new approach produces smooth and accurate matted areas that reveal the object in a plausible manner. Source: https://arxiv.org/pdf/2305.07014.pdf

The advantage of guessing mattes is particularly notable when compared to ray-based methods such as Lidar, which require reflective data capture from the time of the source origination – whereas the new method, based entirely on understanding the conventions of visual behavior in both Lidar and regression scenarios, can create mattes arbitrarily, and without these aides.

Of their new method, the authors comment:

‘Given images of the real world scene and the depth map of the virtual assets, our network directly estimates the mask for compositing. The key advantage is that the network no longer has to estimate the real-valued depth for every pixel, and instead focuses on the binary decision: is the virtual pixel in front or behind the real scene here?

‘Further, at inference time we can use the soft output of a sigmoid layer to softly blend between the real and virtual, which can give visually pleasing [compositions], compared with those created by hard thresholding of depth maps.’

The approach, at its core, treats the original computation methods as mere data generation routines for a purely computer vision-based approach. The researchers observe that by framing the problem as a segmentation task, it’s possible to apply temporal smoothing to improve the overall output, and add:

‘Our ‘implicit depth’ model ultimately results in state-of-the-art occlusions on the challenging ScanNetv2 [dataset]. Further, if depths are needed too, we can compute dense depth by gathering multiple binary masks.

‘Surprisingly, this results in state-of-the-art depth estimation.’

Trafalgar Square in London gets flooded with an AR deluge whose depth is estimated rather than calculated, with the new method.

The new paper is titled Virtual Occlusions Through Implicit Depth, and comes from seven researchers across the aforementioned institutions. A project website is also available, with video examples (from which the lower-quality example animations in this article have been drawn).

Approach

The new work is based on classification-based depth estimation, which attempts to calculate whether any particular pixel in a scene is in front of or behind the virtual object, rather than by using recreated depth maps as a kind of faux CGI model to accomplish this.

This particular task was of great interest to researchers at the tail-end of the 3D boom, since it’s one way of potentially recreating ‘twinned’ stereoscopic image pairs from a single image source.

From the 2016 paper ' Deep3D: Automatic 2D-to-3D Video Conversion with DNNs', a method to use estimated depth maps to create 'fake' stereoscopy from mono-view data. Source: https://arxiv.org/pdf/1604.03650.pdf
From the 2016 paper ' Deep3D: Automatic 2D-to-3D Video Conversion with DNNs', a method to use estimated depth maps to create 'fake' stereoscopy from mono-view data. Source: https://arxiv.org/pdf/1604.03650.pdf

The new system directly estimates depth and occlusion conditioned on both the RGB source images (i.e., in a webcam or live stream) and on the entire depth hierarchy provided by the virtual object.

It should be noted that the virtual object, whether CGI or neural, is naturally going to come with extra depth information; that this ‘true’ depth information is essentially ‘free’, because it takes no extra effort to generate it; and that the source images lack this depth information, and will be occluded based on estimates of pixel z-order, and not by comparing their (non-existent) depth data with the virtual object’s real depth data.

Ordinarily, problems like this are handled nowadays by a comparative approach, in that the system trains on paired images, such as CGI game footage and real-world driving footage.

In this case, that won’t work, because the end use-cases for the system are far too diverse to be covered by a broadly-generalized comparative model.

Therefore the researchers have split the task across two components. In the schematic below, we see, above, the traditional depth-based approach, where the RGB plate (i.e., ‘background’) image is depth-estimated, and the z-order of occluding objects calculated.

Architectural approach for the new system.
Architectural approach for the new system.

Instead (pictured below in the image), the new method passes the image data to a backbone net which extracts features which are then passed to a Mask Predictor module – a Multi-Layer Perceptron (MLP) that iterates through the features and predicts a depth approximation. This is then correlated to the known depth information of the virtual object, and the occlusion effected.

The MLP’s approach to segmentation is ‘inspired’, say the authors, by Facebook Research’s 2019 Detectron 2 PointRend image segmentation framework.

Facebook Research's PointRend system, from 2020. Source: https://github.com/facebookresearch/detectron2/tree/main/projects/PointRend
Facebook Research's PointRend system, from 2020. Source: https://github.com/facebookresearch/detectron2/tree/main/projects/PointRend

The system uses the calculated previous frame to inform the calculation of the next frame in the sequence, allowing for temporal smoothing, and aiding, to an extent, in continuity and coherence of predictions regarding the occlusion mattes. This aspect of the pipeline was inspired, the authors state, by two prior papers.

To train the system, the authors required dataset tuples containing pixel-aligned ground truth depth maps and associated camera poses. Datasets used were 2017’s ScanNet and Apple’s 2021 Hypersim.

An example of an annotated scene from the 2017 ScanNet dataset, used in the new system. Source: https://arxiv.org/pdf/1702.04405.pdf
An example of an annotated scene from the 2017 ScanNet dataset, used in the new system. Source: https://arxiv.org/pdf/1702.04405.pdf

However, since neither of these datasets contain augmented virtual depths, it was necessary to synthesize these during the training.

Real and sampled depths in the new system.
Real and sampled depths in the new system.

Since occlusion events can be ambiguous, default results on binary pixel judgements (i.e., ‘is this pixel in front or behind the object?’) frequently produce a 0.5 estimations, which essentially equates to ‘We don’t know!’. To combat this, the researchers have addressed such ‘depth discontinuities’ in the ground truth depth maps through use of a Sobel operator. An L1 loss function is then applied to penalize such ambiguous predictions.

The authors found that their ‘cheat’ method can actually also be used to create a full-fledged depth map, by using the backbone multiple times to iterate through the guessing process, until a complete depth map is created.

A tacit depth map produced by iterating through pixel 'guesses' with the system's backbone.
A tacit depth map produced by iterating through pixel 'guesses' with the system's backbone.

Though this is an interesting development, potentially, for alternative systems that require depth maps in order to function, this novel approach to depth map generation is not particularly relevant to the scope of the project.

Training and Tests

The models and baselines were trained on Adam at a batch size of 24 across two (unspecified) GPUs, for 40,000 steps at an initial learning rate of 0.0001, with a progressive learning schedule. Standard flip and color augmentations were used, which randomly reverse the images and add color interpretations, both designed to make the final model robust to the vagaries of novel (i.e., unseen) data.

The backbone mentioned above is derived from the 2022 SimpleRecon framework, which builds a cost volume through plane sweeping. The results are then fed to a U-Net++ framework, originally designed for medical image segmentation, which outputs the final depth map.

The rival architectures were tested quantitatively on ScanNetV2, using the default training, testing and validation split. For qualitative tests, an additional model was trained on Hypersim. For visual occlusion comparisons, the system was tested against ARKit’s Lidar depth sensing, and also against Facebook’s Fast Depth Densification system.

Two variants of the backbone for the new framework were produced for the testing rounds, one based on SimpleRecon, and another, lighter version that uses fewer resources, based around a ResNet-18 encoder. Additionally, the authors tested a ResNet variant lacking a cost volume estimator, based on NianticLabs’ own MonoDepth2 estimator.

For the evaluation of occlusion quality during object insertion, Intersection-over-Union (IoU) was used, and comparative frameworks were base SimpleRecon, MonoDepth2, and the Niantic-led project ManyDepth.

Results for depth regression.
Results for depth regression.

Of these results, the authors state:

‘In all cases, our method improves occlusion scores, most notably in difficult cases near surfaces. Additionally, our lightweight ResNet variant yields strong [performance] while operating at a fraction of the compute time (20ms vs 64ms on an A100 GPU).’

For evaluating depth estimation, the researchers compared their system to the KAIST-led DPSNet, MVDepthNet, DELTAS, GPMVS, DeepVideoMVS (fusion version, which also provides the evaluation protocol for this test), and the two flavors of SimpleRecon described earlier.

Results (right) for depth evaluation under the new system.
Results (right) for depth evaluation under the new system.

Here the researchers comment:

‘Notably, when using [SimpleRecon] as a backbone, our method achieves a new state-of-the-art on ScanNetv2 in depth estimation, alongside our core occlusion-IoU evaluation.’

Regarding the qualitative comparisons (see image below), the authors observe:

‘Our occlusions are typically more realistic than baselines, in particular around soft edges, e.g. leaves. We also avoid catastrophic failures, e.g. around the bars in the final row.’

Qualitative comparisons for the new system, pitted against Apple's Lidar-based framework, and Facebook Research's Fast Depth Densification system. See the paper for better detail and resolution.
Qualitative comparisons for the new system, pitted against Apple's Lidar-based framework, and Facebook Research's Fast Depth Densification system. See the paper for better detail and resolution.

However, they also note that some interesting failure cases arose, due to limitations in the Hypersim training data. In one case, an occlusion failed to penetrate two layers of glass in a traditional London telephone booth:

Limitations of the Apple Hypersim data caused some occlusion failures, such as the disappearance of the imposed virtual object when viewed through two planes in a phone booth.
Limitations of the Apple Hypersim data caused some occlusion failures, such as the disappearance of the imposed virtual object when viewed through two planes in a phone booth.

The researchers hypothesize that later work could omit even more phases in the occlusion process, by directly calculating the final occluded image, instead of using faux, per-pixel depth estimation to shortcut a full depth-map process.

Conclusion

The new system proposed in this paper exemplifies the hope – terrifying to some – that complex workflows and methodologies could eventually become mere fodder for imitative pipelines that use them as training data and reproduce, or improve upon their results in a tacit and simplified manner*.

In this case, however, given that latency is a critical issue in AR occlusion scenarios, any process that can reduce the complexity of matte generation is likely to be generally well-received by the research sector and industry interests alike.

 

* (i.e., by predicting what the output of a process would look like and learning to generate that output, instead of following the original process.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle