Uncovering a Body With AI

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Most techniques to create neural renders of humans are solipsistic in nature, in that the target subject’s universe and reality ends at the point of occlusion – the point at which their face and/or body either falls out of frame or is hidden by some object, such as another person who has walked in front of them – or, perhaps, a table that is obscuring their legs:

In the world of neural rendering, there's no 'tree falling in the woods'. no legs under the table, and absolutely nothing 'off-stage'.
In the world of neural rendering, there's no 'tree falling in the woods'. no legs under the table, and absolutely nothing 'off-stage'.

Typically, older VFX approaches, such as CGI, do not have this limitation; even if only the upper half of the body is to be rendered, the body is almost certainly going to be entirely represented within the scene (though, as in the above example, it may be partially hidden), because it is usually easier to model at least a basic complete human body – and in many cases the modeled body will have been originated from a default template provided in various software packages and CGI workflows.

This non-solipsistic ‘presence’ of the depicted character has a number of advantages. For one thing, it has long been relatively trivial to attach automated systems of physical movement such as inverse kinematics to a CGI figure. This means that if a hidden lower part of a CGI figure moves, the non-hidden upper sections, the parts that the viewer can see, will be realistically affected, and will move plausibly.

This provides an integrity of movement and a synergistic model that is not currently easy to replicate in an equivalent neural representation.

Rigging a human mesh in the open source package Blender – a routine operation, and one which can incorporate natural human movement physics such as inverse kinematics. Source: https://www.youtube.com/watch?v=GmvWOcqHEjw
Rigging a human mesh in the open source package Blender – a routine operation, and one which can incorporate natural human movement physics such as inverse kinematics. Source: https://www.youtube.com/watch?v=GmvWOcqHEjw

Apart from anything else, CGI offers creative choices not present in neural systems where only the visible pixels have any existence or putative substance. For instance, if it should be required that part of the character’s leg be shown, that leg is already available.

There are thousands of CGI-based VFX shots in movie and TV output each year where the viewer might only see a fleeting eye, arm, hand or mouth of a monster or some other fantastical creature (or de-aged person), but where the entirety of the geometry was available, as necessary, because CGI approaches model an entire parametric and textured world, which can be selectively rendered, as needed.

The emerging demands of AI-VFX also require this kind of flexibility, despite the lack of any standard mechanism to provide it. For instance, to create a consistent deepfake of a face for 4-5 seconds, the source material must be unobstructed, so that the deepfaking system can consistently transform the original face. Should a temporary obstruction occur, such as a hand moving in front of a face, the deepfaking system will attempt to account for the hand, ruining the results:

Typical 2017-era deepfake frameworks such as DeepFaceLab and DeepFaceLive are expecting unobstructed faces. Source: https://blog.metaphysic.ai/to-uncover-a-deepfake-video-call-ask-the-caller-to-turn-sideways/

The same applies to the emerging field of full-body deepfakes, where neural systems cannot be guaranteed to be supplied with perfect and unobstructed views on which to perform their extraordinary transformative processes. Again, some intrinsic understanding of what’s hidden will often be necessary for an effective shot.

Therefore, despite AI’s massively increased ability to produce convincing human representations, neural workflows often need to have an understanding of hidden geometry –  a native function of CGI.

Uncovering a Body

One new offering, from China and the UK, is proposing a superior method of reconstructing occluded regions of the human body in source footage, by separately evaluating the perceived body parts and the parts of the scene with which they might be interacting.

On the left we see all available input for the new system: top, the actual real-world source footage; middle, the segmented evaluation; bottom, the evaluated point cloud, or shadow volume. In the middle section we see the new approach compared to prior methods. Source: https://arxiv.org/pdf/2310.01228.pdf
On the left we see all available input for the new system: top, the actual real-world source footage; middle, the segmented evaluation; bottom, the evaluated point cloud, or shadow volume. In the middle section we see the new approach compared to prior methods. Source: https://arxiv.org/pdf/2310.01228.pdf

In tests, the new technique proved more effective than prior works, and offers a number of innovations, such as the creation of a ‘free zone’ – a constrained area of the frame which covers all regions that might be occupied by a body and by anything with which it is interacting. This approach is aided by the generation of volumetric point cloud data as well as semantic segmentation, and frees the system from the need to evaluate and locate the body within the context of the entire frame.

Here we see the reduced 'sphere of influence' of the Free Zone, allowing the system to concentrate and refine its evaluative resources on a fruitful predicted area.
Here we see the reduced 'sphere of influence' of the Free Zone, allowing the system to concentrate and refine its evaluative resources on a fruitful predicted area.

Additionally the new work offers a much-needed novel evaluation metric for this specific task – and one that goes further than previous metrics, by calculating not just the perceived surface of the human body, but taking into account the volumetric mass provided by the point cloud, so that a more substantial understanding can be obtained regarding potential touching and collision with real world objects.

Gaming veterans will be familiar with the latter problem, known as ‘clipping’, having seen many examples where the geometry of a displayed character illogically seems to pass through a supposedly solid object, such as a wall or a door. Since video-games are essentially CGI, the solutions have improved notably in the past couple of decades, with the development of better collision detection systems; but this remain a novel and thorny challenge for neural approaches that have no such native concept of discrete entities in a scene.

The new paper is titled Reconstructing 3D Human Pose from RGB-D Data with Occlusions, and comes from two researchers, respectively, from Xi’an Jiaotong University, and University College London.

Approach

The central idea of the new work, as mentioned above, is to reduce the ‘solution space’ (i.e., the area of the frame relevant to rendering a neural human) so that the system does not need to take into account so many variables, and will therefore be less likely to be confounded by irrelevant information.

The ideal solution to such problems would be entirely AI-based, where evaluation and transformations were effected entirely at a machine learning level. However, the advantages of CGI are so overwhelming that a notable tranche of solutions in this research strand are making greater and greater use of parametric CGI models, whose poses and geometry are converted into neural terms, and which therefore provide a level of control and instrumentality that current neural systems lack.

In the case of the new system, and several that have preceded it, this comes in the form of Skinned Multi-Person Linear Model (SMPL) geometries, an innovation that stems from research from the late 1990s, and that has gained ground in the last 5-6 years, as the excitement over the realism of new neural systems has ceded to frustration over how difficult they are to control.

For the new system, the SMPL-X model is used:

The SMPL-X model derives key points from a source image (here simply a grab from Getty images), and assembles a core skeleton which is 'fitted' to a default parametric (i.e., vector-based) CGI model. Since all the parameters of the model are explicit, they can provide an extraordinarily effective interface for the sparser parameters available in the neural space. Source: https://smpl-x.is.tue.mpg.de/
The SMPL-X model derives key points from a source image (here simply a grab from Getty images), and assembles a core skeleton which is 'fitted' to a default parametric (i.e., vector-based) CGI model. Since all the parameters of the model are explicit, they can provide an extraordinarily effective interface for the sparser parameters available in the neural space. Source: https://smpl-x.is.tue.mpg.de/

The optimization framework for the new work uses the SMPLify-X method (the methodology of the SMPL-X project linked above) to initially obtain a scene mesh, before the addition of the novel Free Zone and the truncated shadow volume (which provides, as mentioned, a limited volumetric point cloud for the neural representation of the person).

By itself, this approach could potentially leave the rendered body ‘floating’ and disembodied from its environment. Therefore the system additionally makes use of the PROX-D framework, which is occupied with evaluating contact surfaces:

The PROX project estimates practical boundaries for a human depiction. Source: https://prox.is.tue.mpg.de/
The PROX project estimates practical boundaries for a human depiction. Source: https://prox.is.tue.mpg.de/

However, PROX-D in itself uses the Signed Distance Field (SDF) of an entire scene to achieve its distinctions, which the authors of the current paper consider inefficient:

‘The effectiveness of this intuitive approach heavily relies on the accuracy and completeness of the scene. In reality, limitations in the scanning devices and the complexity of the scene can cause errors. Besides, this approach becomes even more ineffective in scenarios where the body part penetrates into the scene deeply or penetrates through thin objects.’

This limitation is addressed by the system’s Free Zone Network (FZNet), an encoder-decoder architecture – essentially a Multi-Layer Perceptron (MLP) which is fed a point cloud evaluated by the Occupancy Networks (OccNet) encoder, which provides a volumetric evaluation.

(It should be noted that nearly all the technologies used in this project are either originated or co-authored by the Max Planck Institute for Intelligent Systems [MPI], which is the principal or major contributor to 3DMMs and SMPL, as well as various other CGI-based successors aimed at providing parametric interfaces for neural systems)

The resulting point cloud feature is fed back to the point cloud encoder for refinement and all variables obtained thus far are then passed to the MLP decoder to predict the Free Zone.

Initial delineation of the body area is provided by Google’s DeepLabV3, which performs segmentation on the perceived subject. The segmentation mask will also be used to constrain which points from the noisy point cloud are finally chosen as representing the human in the source image. Ultimately, 1024 points are arrived at, using Farthest Point Sampling (FPS), and the Free Zone network is trained on the L1 distance between the constrained prediction and the ground truth.

Overview of development for the Free Zone, showing the neural human representation (below left) and the wider segment of the scene, the 'aura' of the Free Zone.
Overview of development for the Free Zone, showing the neural human representation (below left) and the wider segment of the scene, the 'aura' of the Free Zone.

This is still not enough for the requisite accuracy, because the filtered neural person is still present inside a point cloud that contains other material. Therefore a truncated shadow volume is calculated, by evaluating the path of rays from shadows that exist within the source image:

Calculating the Truncated Shadow Volume by evaluating rays from the virtual camera direction (i.e., the viewpoint of the source footage/frame).
Calculating the Truncated Shadow Volume by evaluating rays from the virtual camera direction (i.e., the viewpoint of the source footage/frame).

This is not a dissimilar process to Neural Radiance Fields (NeRF), a method which generates volumetric mass by calculating ray directions from a variety of images, before concatenating the results into an explorable environment or object.

Finally the obtained points must be matched to an SMPL-X model on a per-point basis, so that the model’s own coordinate and other variables can act as proxies for the neural representation. For this, FPS is used to obtain the frontal vertices of the person represented in the frame, but this time by using a virtual camera placed in front of the body. Since the resulting vertices are uniformly distributed, it is then possible to match them to equivalent placements in the SMPL-X model on a point-by-point basis until equivalency is achieved:

In this representation of the vertices-matching process, the red points represent the rear of the body, and the purple the front.
In this representation of the vertices-matching process, the red points represent the rear of the body, and the purple the front.

The authors comment:

‘Free zone can be seen as a superset of the body and truncated shadow volume can be seen as a subset of the body. We match the human body with free zone points and truncated shadow volume points separately.’

Data and Tests

To test the system, the researchers used the two available PROX datasets – a quantitative set featuring 180 static RGB-D frames, with ground truth, where a single subject wearing motion capture-style markers interacts with living-room furniture; and a qualitative set of 100,000 RGB-D frames with pseudo ground truth, and featuring 20 subjects in a variety of scenarios over 12 scenes. The pseudo ground truth in the second set is provided by SMPL-X parameters that are fitted through PROX-D.

For test material, the researchers used the POSA dataset from the 2021 MPI paper Populating 3D Scenes by Learning Human-Scene Interaction.

The data was split 4:1, with evaluations carried out on the aforementioned testing set and on the PROX quantitative set. Data with high penetrations (i.e., where the human component was very heavily enmeshed in contact with other items) was excluded.

To train the FZNet, 20,000 query points were sampled, of which 95% were near the surface, and the rest within the volumetric estimation of the body. Perturbations were added as signal markers to determine that particular point’s proximity to the estimated human body. Data augmentation was used to increase diversity and generalization.

The FZNet was trained on the Adam optimizer at a learning rate of 1e-4 (the lowest usable learning rate). A learning rate schedule was also used, with a decay rate (i.e., a gradual deceleration in the extent to which each iteration was altered) of 0.5 after the initial 100 epochs (i.e., complete tours of the available data by the training routine).

The models were trained on a NVIDIA 3090ti GPU with 24GB of VRAM, for a total of 200 epochs.

It was necessary to engage with CGI’s two available coordinate systems, in order to rationalize the data. In this case, the camera coordinate system (coordinates relative to the viewpoint) was transliterated to the world coordinate system (absolute and unvarying coordinates that will return uniform position data from any standpoint).

For an initial qualitative (non-competitive) round (with the large results split for convenience into two images below – refer to source paper for better resolution), the authors note that the Free Zone Network can still produce acceptable results even when the subject is in an unusual position (bottom row of second image below):

Results of an initial non-competitive qualitative test round.
Results of an initial non-competitive qualitative test round.

The paper further states:

‘When the human has a lot of contact with the scene (row 3), such as when the human is lying on a sofa, our method can produce plausible results with little penetration and necessary contact. In cases where the scenes are more complex (row 4), such as when the human is standing between a sofa and a pot of plants, our free zone network can still identify the correct region.

‘Even when the human is not captured by the depth camera (row 5), the free zone network can still estimate the free zone correctly using only the scene information.’

For a quantitative and competitive round, the new approach was pitted against SMPLify and PROX-D on a variety of evaluation metrics: 3D reconstruction used Joint Position Error (JPE) and Vertex-to-Vertex Error (V2V), which, respectively, judge the mean error between joint positions, and the mean error between related vertices; to test alignment accuracy with the depth data, a Partial Matching (PM) metric was used to measure mean distance between the body points and their related vertices on the reconstructed mesh; a Non-Collision (NC) metric from a prior 2019 work was also used.

Examples of collision evaluation from the 2019 MPI-co-authored paper 'Generating 3D People in Scenes without People', which contributes a metric to the current paper's tests. Source: https://arxiv.org/pdf/1912.02923.pdf
Examples of collision evaluation from the 2019 MPI-co-authored paper 'Generating 3D People in Scenes without People', which contributes a metric to the current paper's tests. Source: https://arxiv.org/pdf/1912.02923.pdf

Additionally, the authors devised an augmented version of the latter metric, titled Volume Non-Collision (VNC), which evaluates the SDF score of points inside the neural body. The authors explain:

‘The VNC score is lower when the penetration is more severe. By considering the points inside the body, this metric provides a more comprehensive evaluation of penetration. In Figure 7 [see image below], the VNC scores of these two examples are 0.99 and 0.79 respectively, suggesting a more accurate evaluation of penetration compared with the NC.’

Examples where simple NC (rather than the authors' modified VNC metric) cannot correctly identify penetration areas.
Examples where simple NC (rather than the authors' modified VNC metric) cannot correctly identify penetration areas.

In the results for the initial PROX quantitative test, the authors’ method achieved the lowest error rate among all metrics, although the researchers note that this test covers only one scene in the quantitative set, with simple interactions:

Results for the PROX quantitative test.
Results for the PROX quantitative test.

In the qualitative PROX test, the new approach likewise swept the board:

Results for the PROX qualitative round.
Results for the PROX qualitative round.

The authors also compared their approach to the prior methods in a broad qualitative test (again, split into two images for convenience):

Comparisons with the previous frameworks.
Comparisons with the previous frameworks.

Regarding these results, the authors comment:

‘In the first 4 rows, where the human is partially occluded by the scene or by themselves, both SMPLify-D and PROX-D produce results where some parts penetrate the scene. However, our method can infer the correct pose of the invisible body parts and avoid penetrations.

‘When some body parts penetrate deeply into objects, such as a leg penetrating into a sofa or a hand penetrating into a wall (row 1, 2), it is hard for current methods to pull the body out of the object completely. However, our method uses the free zone to guide the body away from the object, effectively reducing the penetration. Our method can also handle cases where some body parts penetrate through thin objects like a table (row 3, 4), preventing such penetrations from occurring.

‘In the last 2 rows, where different body parts overlap with each other, our results exhibit better alignment with the scanned body point cloud compared with other methods thanks to the constraint of the truncated shadow volume.’

In a further test (see paper for details), the researchers found that their method was better able to perform reconstruction when larger amounts of the target body were obscured.

Conclusion

Neural rendering is naturally myopic, compared to CGI, and it is quite possible than it will always need to work in conjunction with systems, such as CGI, that offer consistent and rational bounds.

However, this is not a fight that the image synthesis research sector intends to concede lightly, since this would position neural output more in the context of a texture-generator than a full-fledged generative system.

CGI is not the only system that contains ‘off-stage’ information that would be useful in the cases illustrated here; latent diffusion architectures such as Stable Diffusion also have immense prior knowledge about human anatomy – and, unlike CGI, this knowledge is highly abstracted and massively generalized.

Unfortunately, as we have noted many times, LDMs have no internal mechanisms that easily facilitate smooth temporal movement, and so cannot stand in for CGI in cases such as the authors address in the new work, which are aimed at evaluating video content rather than just static images.

Therefore the current state-of-the-art seems set to leave parametric and neural methods in an uneasy and awkward collaborative association, pending further developments that address the issues outlined in the new paper.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle