A recent paper from Berkeley offers a notable advance on the state of the art in human mesh recovery – the act of identifying humans in video or images and reconstructing their physical disposition.
The new system, as the authors volunteer, is a novel iteration of an influential prior work, wherein Transformer-based attention is used to shortcut many of the elaborate procedures that have characterized other attempts to improve and optimize the older system. With this Transformer-centered approach, the system, titled HMR 2.0, has been able to exceed results obtained by the nearest rivals in a range of tasks for which human pose estimation is typically used.
HMR 2.0 (above right) achieves greater accuracy in pose interpretation vs. prior networks. Source: https://shubham-goel.github.io/4dhumans/
One of the notable innovations of the new approach is the ease with which it can evaluate and track multiple subjects from source content:
The system uses Facebook Research’s Detectron 2 to identify humans in source material, as well as a slew of adjunct and complementary technologies and prior approaches to provide an improved human pose inference framework.
Source and SMPL image, derived by HMR 2.0. Since the SMPL exists in an explicit and known 3D space, it can be accurately interpreted from any angle. See the source website for better detail and resolution, and many other examples – https://shubham-goel.github.io/4dhumans/
In line with similar research from the likes of ETH Zurich, the new system uses the Skinned Multi-Person Linear Model (SMPL) system to provide a CGI-based 3D representation of the extracted pose, and a conventional digital space that can be influenced by the variables emerging from the neural-based analytical processes of the system.
HMR 2.0 in action. See the source website for better detail and resolution, and many other examples – https://shubham-goel.github.io/4dhumans/
The authors of the new work comment:
‘We obtain unprecedented accuracy in our single-image 3D [reconstructions] even for unusual poses where previous approaches struggle. In video, we link these reconstructions over time by 3D tracking, in the process bridging gaps due to occlusion or detection failures.’
Demand for Full-Body Pose Tracking
Demand for this kind of pose extraction is notable in VFX pipelines, but also across a range of sectors, including medical analysis, security, fashion and retail applications, among many others. Additionally, the possible coming of interactive augmented reality scenarios will require low-latency interpretation of live human streaming images, in order for their reconstructed and/or augmented versions to appear in the user’s viewpoint. Further, immersive VR video-games are likely also to need this kind of capability.
While live deepfaking of faces is a current technology, live deepfaking of bodies is a relatively new prospect. Just as venerable facial alignment systems such as FAN Align have been powering facial pose recognition for some years, the new generation of body-capable deepfake systems will also need responsive and accurate pose estimation systems that are capable of distinguishing multiple individuals and tracking them over time. The new system proposed by the Berkeley researchers offers this functionality, and, the authors state, is able to re-acquire particular individuals after they have been occluded or left the field of view momentarily.
A perhaps equally pressing imperative is the improved use of SMPL, 3DMM and other such ‘legacy’ CGI technologies as ‘mediators’ between the mysterious latent space of generative and neural systems, and the user-space of the interfaces that will have to leverage them.
Further, the ability to individuate people in a larger group has potential in terms of crowd-counting, as well as in matting and general segmentation research, where it’s necessary to obtain the absolute boundaries of individuals within a frame or image, in order to extract them from their background.
The idea to ‘Transformerize’ the original work was taken, the authors state, from ViTPose, which similarly used Transformers to improve upon existing methods for extracting 2D poses. The 2D pose pipeline is simpler than its 3D equivalent; for instance, FAN Align, used in deepfakes frameworks such as DeepFaceLab and FaceSwap, is a 2D facial landmark extraction algorithm, while more sophisticated recent algorithms such as Microsoft’s 2022 offering 3D Face Reconstruction with Dense Landmarks incorporates X/Y/Z understanding of 3D depth into landmark assignation, enabling more effective and applicable landmarks.
Transformers operate similarly to Recurrent Neural Networks (RNNs), except that they are capable of addressing the entire input at once, instead of concentrating on the data segment directly in front of them – effectively the difference between a bird’s eye view of the factory floor and the POV of a conveyor belt operator. This capacity to form an immediate overview allows Transformers to be run in parallel, speeding up the processing of data.
HMR 2.0 uses a standard Transformer decoder with multi-head attention to process tokens extracted from the input images. The workflow is inspired by the 2019 University of Pennsylvania/Max Planck system Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop, with the regression of 3D rotations handled by a prior collaboration between USC and Adobe, among others.
The system’s predictor is trained with a combination of losses, including 2D, 3D, and a discriminator loss. A discriminator is developed for three factors of the body model: the shape parameters of the SMPL model; the estimated body-pose parameters; and the per-section relative rotations (i.e., rotations relative to a canonical or ‘default’ base posture).
HMR 2.0 is also boosted by pseudo-ground truth fitting, where apposite third-party data is included into the training process as a supplementary aid to accuracy. The datasets used here are InstaVariety, AVA, and AI Challenger).
To track people in source frames, the authors leverage PHALP, which has been modified and optimized for the project, for instance to operate on the CGI mesh of the SMPL ‘virtual human’ model that is at the heart of the system. Perhaps a little confusingly, this iteration of the prior work has been called PHALP′ (note the end single quote).
For pose prediction, the researchers trained a BERT-like Transformer model on 1 million tracks from Berkeley’s own Lagrangian Action Recognition with Tracking (LART) project.
The authors note that HMR 2.0, unlike previous approaches such as PHD and HMMR, combines reconstruction and tracking into a unified workflow that’s capable of operating on ‘in the wild’ videos (i.e., on unseen data not included at training time) of multiple people – the latter aspect perhaps the most notable innovation of the project.
Data and Tests
To evaluate the system, the Berkeley researchers tested HMR 2.0 for pose accuracy, tracking and action recognition. Datasets used were Human3.6M, MPI-INF-3DHP, COCO and MPII, in addition to the aforementioned InstaVariety, AVA and AI Challenger datasets. Baseline benchmarks adopted were PyMAF, CLIFF, HMAR, PARE and PyMAF-X (body-only performance).
To test pose accuracy, the authors adopted the standards set by the SMPL oPtimization IN the loop (SPIN) project, enacted on the Human3.6M val split. Here the authors comment;
‘We observe that with our HMR 2.0a model, which trains only on the typical datasets, we can outperform all previous baselines across all metrics.’
A second variant on the HMR 2.0 model, HMR 2.0b, was trained for longer on more extensive data from the three aforementioned datasets, and the authors note that this iteration of the system offers superior performance and accuracy of estimation on a wider variety of challenging poses.
For tracking, the project relies on the modifications of PHALP’, which jettisons the model-centric schema of HMAR (see above) and applies the SMPL coordinate space across the majority of the mesh recovery systems featured in the workflow.
Here the authors note*:
‘Using the same bounding box detector as [HMAR and PHALP], 4DHumans outperforms existing approaches on all metrics, improving ID Switches by 16%. Using the improved ViT-Det detector can improve performance further. As a by-product of our temporal prediction [model], we can perform amodal completion and attribute a pose to missing detections.’
For action recognition, the testing methodology of Berkeley’s own LART was used, with vanilla PHALP employed as an off-the-shelf pose and location estimator for predicting action labels. The researchers trained a separate action classification Transformer for each ‘pose-only’ baseline, derived solely from 3D pose and location estimates.
The authors note that their system outperforms all rivals in this regard across various class categories, and that in one category it exceeds the nearest older system by 14%. They further comment*:
‘Since accurate action recognition from poses needs fine-grained pose estimation, this is strong evidence that HMR 2.0 predicts more accurate poses than existing approaches. In fact, when combined with appearance features, shows that HMR 2.0 achieves the state of the art of 42.3 mAP on AVA action recognition, which is 7% better than the second-best of 39.5 mAP.’
Though the authors state that HMR 2.0 ‘pushes the boundaries’ of what is possible in the reconstruction space, they observe also that the inherent limitations of the SMPL model enforces certain arbitrary constraints upon any system that depends on it. SMPL is intended to capture full-body poses, while other models concentrate (as we have noted) on facial disposition, or on the more complex and intricate posing of hands.
Ideally, a later variant on SMPL could incorporate and orchestrate such separate systems, so that tracking and pose recognition would not be such an ‘either/or’ proposition. However, SMPL is only one example of ‘reproducibility entropy’, where a system that becomes increasingly outdated in the face of new innovation nonetheless becomes increasingly embedded into the research literature, since its long history allows like-for-like comparisons and continuity across diverse research efforts. For this thorny and somewhat political issue, the authors envisage no immediate solution.
They also suggest that the need to individually evaluate multiple figures in a pipeline and then concatenate them into final output could in itself represent a bottleneck, and that very close subject proximity to the camera can degrade the quality of results. Therefore HMR 2.0 operates best within a predictable set of variables and restraints for the sector.
Here the augmented attention of Transformers has been innovatively shoe-horned into a super-structure of prior frameworks – perhaps one of the most complex assemblies of ‘legacy’ approaches in this strand of computer vision literature. It’s encouraging to see multi-person tracking receiving growing interest in the research community, though discouraging to see possible further innovations hamstrung by the limits of older CGI-based parametric approaches such as SMPL.
Perhaps the value of pushing the envelope even further in this line of research will be to force the development of new and better ways to incorporate advanced mesh interfaces such as 3DMM and SMPL into systems that they were not entirely expecting when they were originated. It’s hard not to feel that this particular obstruction is due for a notable and epochal breakthrough soon.
* My conversion of inline citations to hyperlinks