Most techniques to create neural renders of humans are solipsistic in nature, in that the target subject’s universe and reality ends at the point of occlusion – the point at which their face and/or body either falls out of frame or is hidden by some object, such as another person who has walked in front of them – or, perhaps, a table that is obscuring their legs:
Typically, older VFX approaches, such as CGI, do not have this limitation; even if only the upper half of the body is to be rendered, the body is almost certainly going to be entirely represented within the scene (though, as in the above example, it may be partially hidden), because it is usually easier to model at least a basic complete human body – and in many cases the modeled body will have been originated from a default template provided in various software packages and CGI workflows.
This non-solipsistic ‘presence’ of the depicted character has a number of advantages. For one thing, it has long been relatively trivial to attach automated systems of physical movement such as inverse kinematics to a CGI figure. This means that if a hidden lower part of a CGI figure moves, the non-hidden upper sections, the parts that the viewer can see, will be realistically affected, and will move plausibly.
This provides an integrity of movement and a synergistic model that is not currently easy to replicate in an equivalent neural representation.
Apart from anything else, CGI offers creative choices not present in neural systems where only the visible pixels have any existence or putative substance. For instance, if it should be required that part of the character’s leg be shown, that leg is already available.
There are thousands of CGI-based VFX shots in movie and TV output each year where the viewer might only see a fleeting eye, arm, hand or mouth of a monster or some other fantastical creature (or de-aged person), but where the entirety of the geometry was available, as necessary, because CGI approaches model an entire parametric and textured world, which can be selectively rendered, as needed.
The emerging demands of AI-VFX also require this kind of flexibility, despite the lack of any standard mechanism to provide it. For instance, to create a consistent deepfake of a face for 4-5 seconds, the source material must be unobstructed, so that the deepfaking system can consistently transform the original face. Should a temporary obstruction occur, such as a hand moving in front of a face, the deepfaking system will attempt to account for the hand, ruining the results:
Typical 2017-era deepfake frameworks such as DeepFaceLab and DeepFaceLive are expecting unobstructed faces. Source: https://blog.metaphysic.ai/to-uncover-a-deepfake-video-call-ask-the-caller-to-turn-sideways/
The same applies to the emerging field of full-body deepfakes, where neural systems cannot be guaranteed to be supplied with perfect and unobstructed views on which to perform their extraordinary transformative processes. Again, some intrinsic understanding of what’s hidden will often be necessary for an effective shot.
Therefore, despite AI’s massively increased ability to produce convincing human representations, neural workflows often need to have an understanding of hidden geometry – a native function of CGI.
Uncovering a Body
One new offering, from China and the UK, is proposing a superior method of reconstructing occluded regions of the human body in source footage, by separately evaluating the perceived body parts and the parts of the scene with which they might be interacting.
In tests, the new technique proved more effective than prior works, and offers a number of innovations, such as the creation of a ‘free zone’ – a constrained area of the frame which covers all regions that might be occupied by a body and by anything with which it is interacting. This approach is aided by the generation of volumetric point cloud data as well as semantic segmentation, and frees the system from the need to evaluate and locate the body within the context of the entire frame.
Additionally the new work offers a much-needed novel evaluation metric for this specific task – and one that goes further than previous metrics, by calculating not just the perceived surface of the human body, but taking into account the volumetric mass provided by the point cloud, so that a more substantial understanding can be obtained regarding potential touching and collision with real world objects.
Gaming veterans will be familiar with the latter problem, known as ‘clipping’, having seen many examples where the geometry of a displayed character illogically seems to pass through a supposedly solid object, such as a wall or a door. Since video-games are essentially CGI, the solutions have improved notably in the past couple of decades, with the development of better collision detection systems; but this remain a novel and thorny challenge for neural approaches that have no such native concept of discrete entities in a scene.
The new paper is titled Reconstructing 3D Human Pose from RGB-D Data with Occlusions, and comes from two researchers, respectively, from Xi’an Jiaotong University, and University College London.
The central idea of the new work, as mentioned above, is to reduce the ‘solution space’ (i.e., the area of the frame relevant to rendering a neural human) so that the system does not need to take into account so many variables, and will therefore be less likely to be confounded by irrelevant information.
The ideal solution to such problems would be entirely AI-based, where evaluation and transformations were effected entirely at a machine learning level. However, the advantages of CGI are so overwhelming that a notable tranche of solutions in this research strand are making greater and greater use of parametric CGI models, whose poses and geometry are converted into neural terms, and which therefore provide a level of control and instrumentality that current neural systems lack.
In the case of the new system, and several that have preceded it, this comes in the form of Skinned Multi-Person Linear Model (SMPL) geometries, an innovation that stems from research from the late 1990s, and that has gained ground in the last 5-6 years, as the excitement over the realism of new neural systems has ceded to frustration over how difficult they are to control.
For the new system, the SMPL-X model is used:
The optimization framework for the new work uses the SMPLify-X method (the methodology of the SMPL-X project linked above) to initially obtain a scene mesh, before the addition of the novel Free Zone and the truncated shadow volume (which provides, as mentioned, a limited volumetric point cloud for the neural representation of the person).
By itself, this approach could potentially leave the rendered body ‘floating’ and disembodied from its environment. Therefore the system additionally makes use of the PROX-D framework, which is occupied with evaluating contact surfaces:
‘The effectiveness of this intuitive approach heavily relies on the accuracy and completeness of the scene. In reality, limitations in the scanning devices and the complexity of the scene can cause errors. Besides, this approach becomes even more ineffective in scenarios where the body part penetrates into the scene deeply or penetrates through thin objects.’
(It should be noted that nearly all the technologies used in this project are either originated or co-authored by the Max Planck Institute for Intelligent Systems [MPI], which is the principal or major contributor to 3DMMs and SMPL, as well as various other CGI-based successors aimed at providing parametric interfaces for neural systems)
The resulting point cloud feature is fed back to the point cloud encoder for refinement and all variables obtained thus far are then passed to the MLP decoder to predict the Free Zone.
Initial delineation of the body area is provided by Google’s DeepLabV3, which performs segmentation on the perceived subject. The segmentation mask will also be used to constrain which points from the noisy point cloud are finally chosen as representing the human in the source image. Ultimately, 1024 points are arrived at, using Farthest Point Sampling (FPS), and the Free Zone network is trained on the L1 distance between the constrained prediction and the ground truth.
This is still not enough for the requisite accuracy, because the filtered neural person is still present inside a point cloud that contains other material. Therefore a truncated shadow volume is calculated, by evaluating the path of rays from shadows that exist within the source image:
This is not a dissimilar process to Neural Radiance Fields (NeRF), a method which generates volumetric mass by calculating ray directions from a variety of images, before concatenating the results into an explorable environment or object.
Finally the obtained points must be matched to an SMPL-X model on a per-point basis, so that the model’s own coordinate and other variables can act as proxies for the neural representation. For this, FPS is used to obtain the frontal vertices of the person represented in the frame, but this time by using a virtual camera placed in front of the body. Since the resulting vertices are uniformly distributed, it is then possible to match them to equivalent placements in the SMPL-X model on a point-by-point basis until equivalency is achieved:
The authors comment:
‘Free zone can be seen as a superset of the body and truncated shadow volume can be seen as a subset of the body. We match the human body with free zone points and truncated shadow volume points separately.’
Data and Tests
To test the system, the researchers used the two available PROX datasets – a quantitative set featuring 180 static RGB-D frames, with ground truth, where a single subject wearing motion capture-style markers interacts with living-room furniture; and a qualitative set of 100,000 RGB-D frames with pseudo ground truth, and featuring 20 subjects in a variety of scenarios over 12 scenes. The pseudo ground truth in the second set is provided by SMPL-X parameters that are fitted through PROX-D.
The data was split 4:1, with evaluations carried out on the aforementioned testing set and on the PROX quantitative set. Data with high penetrations (i.e., where the human component was very heavily enmeshed in contact with other items) was excluded.
To train the FZNet, 20,000 query points were sampled, of which 95% were near the surface, and the rest within the volumetric estimation of the body. Perturbations were added as signal markers to determine that particular point’s proximity to the estimated human body. Data augmentation was used to increase diversity and generalization.
The FZNet was trained on the Adam optimizer at a learning rate of 1e-4 (the lowest usable learning rate). A learning rate schedule was also used, with a decay rate (i.e., a gradual deceleration in the extent to which each iteration was altered) of 0.5 after the initial 100 epochs (i.e., complete tours of the available data by the training routine).
The models were trained on a NVIDIA 3090ti GPU with 24GB of VRAM, for a total of 200 epochs.
It was necessary to engage with CGI’s two available coordinate systems, in order to rationalize the data. In this case, the camera coordinate system (coordinates relative to the viewpoint) was transliterated to the world coordinate system (absolute and unvarying coordinates that will return uniform position data from any standpoint).
For an initial qualitative (non-competitive) round (with the large results split for convenience into two images below – refer to source paper for better resolution), the authors note that the Free Zone Network can still produce acceptable results even when the subject is in an unusual position (bottom row of second image below):
The paper further states:
‘When the human has a lot of contact with the scene (row 3), such as when the human is lying on a sofa, our method can produce plausible results with little penetration and necessary contact. In cases where the scenes are more complex (row 4), such as when the human is standing between a sofa and a pot of plants, our free zone network can still identify the correct region.
‘Even when the human is not captured by the depth camera (row 5), the free zone network can still estimate the free zone correctly using only the scene information.’
For a quantitative and competitive round, the new approach was pitted against SMPLify and PROX-D on a variety of evaluation metrics: 3D reconstruction used Joint Position Error (JPE) and Vertex-to-Vertex Error (V2V), which, respectively, judge the mean error between joint positions, and the mean error between related vertices; to test alignment accuracy with the depth data, a Partial Matching (PM) metric was used to measure mean distance between the body points and their related vertices on the reconstructed mesh; a Non-Collision (NC) metric from a prior 2019 work was also used.
Additionally, the authors devised an augmented version of the latter metric, titled Volume Non-Collision (VNC), which evaluates the SDF score of points inside the neural body. The authors explain:
‘The VNC score is lower when the penetration is more severe. By considering the points inside the body, this metric provides a more comprehensive evaluation of penetration. In Figure 7 [see image below], the VNC scores of these two examples are 0.99 and 0.79 respectively, suggesting a more accurate evaluation of penetration compared with the NC.’
In the results for the initial PROX quantitative test, the authors’ method achieved the lowest error rate among all metrics, although the researchers note that this test covers only one scene in the quantitative set, with simple interactions:
In the qualitative PROX test, the new approach likewise swept the board:
The authors also compared their approach to the prior methods in a broad qualitative test (again, split into two images for convenience):
Regarding these results, the authors comment:
‘In the first 4 rows, where the human is partially occluded by the scene or by themselves, both SMPLify-D and PROX-D produce results where some parts penetrate the scene. However, our method can infer the correct pose of the invisible body parts and avoid penetrations.
‘When some body parts penetrate deeply into objects, such as a leg penetrating into a sofa or a hand penetrating into a wall (row 1, 2), it is hard for current methods to pull the body out of the object completely. However, our method uses the free zone to guide the body away from the object, effectively reducing the penetration. Our method can also handle cases where some body parts penetrate through thin objects like a table (row 3, 4), preventing such penetrations from occurring.
‘In the last 2 rows, where different body parts overlap with each other, our results exhibit better alignment with the scanned body point cloud compared with other methods thanks to the constraint of the truncated shadow volume.’
In a further test (see paper for details), the researchers found that their method was better able to perform reconstruction when larger amounts of the target body were obscured.
Neural rendering is naturally myopic, compared to CGI, and it is quite possible than it will always need to work in conjunction with systems, such as CGI, that offer consistent and rational bounds.
However, this is not a fight that the image synthesis research sector intends to concede lightly, since this would position neural output more in the context of a texture-generator than a full-fledged generative system.
CGI is not the only system that contains ‘off-stage’ information that would be useful in the cases illustrated here; latent diffusion architectures such as Stable Diffusion also have immense prior knowledge about human anatomy – and, unlike CGI, this knowledge is highly abstracted and massively generalized.
Unfortunately, as we have noted many times, LDMs have no internal mechanisms that easily facilitate smooth temporal movement, and so cannot stand in for CGI in cases such as the authors address in the new work, which are aimed at evaluating video content rather than just static images.
Therefore the current state-of-the-art seems set to leave parametric and neural methods in an uneasy and awkward collaborative association, pending further developments that address the issues outlined in the new paper.