It is very difficult to understand the way that neural networks interpret requests, or which parts of their output relate to the request (or the part of the request) that the user originally made. By analogy, the internal workings are as user-obscured as those of an ATM, or a ‘hole in the wall’ retail outlet.
This leads to the frustrating but frequent scenario where we have a very limited understanding of how fruitful and effective trained networks are accomplishing their task.
Without such information, it is difficult to develop new ways to improve neural network architecture – because it’s hard to tell what we did right in the first place.
This Druidic state of affairs is one of the central challenges in explainable AI (XAI), and is frequently addressed in new computer vision initiatives designed to provide human-comprehensible methods of following a requested transformation or generation through the otherwise opaque latent space, to see how the request was treated and interpreted.
One notable approach of recent years is the employment of Class Activation Maps – a method of tracing the path of input requests throughout the network, resulting in a practical heat-map that shows exactly where the input information influenced the output:

This approach, naturally, can be applied to video, since video only constitutes concatenated still images:
The premier application of this technique was originated by a 2016 paper from Virginia Tech, which introduced Gradient-weighted Class Activation Mapping (Grad-CAM) – a systemic approach to creating a kind of neural ‘Barium meal’ to enable researchers to study the influence of data in output.

It is not just the user input for generative systems that can be traced in such ways, but also the presence of tokens trained into the model. In this way, it’s possible to understand the deeper relationships that are activated when a particular word (or image, or any other form of input) is run through a trained system.
Since Grad-CAM came out, a number of projects have tried to improve upon it, or leverage it more specifically for certain domains or applications. Among its descendants are Grad-CAM++, Ablation-CAM, HiResCAM, Axiom-based Grad-CAM, pytorch-grad-cam, LayerCAM, Eigen-CAM, and Score-CAM.

Thus the current CAM scene is a kingdom of reigns, and a plenitude of forks, as diverse sectors and special interests take the technology down specific alleys with diverse targets, without necessarily developing the central original idea in a cohesive way that could be of wider benefit to the community.
For this reason, and others, a formidable association of universities and research institutes has just released a new approach that leverages nearly all these off-shoots into an ensembled method of obtaining the best and most accurate assessment of paths through a neural network.
Dubbed MetaCAM, the new initiative incorporates the aforementioned approaches, and others, as contributors in an orchestrated overall framework that uses a novel method of averaging to obtain scores and defy outlier results, in what may be a promising development on the state of the art in CAM technologies.

The researchers conclude:
‘Our experiments demonstrate that MetaCAM is able to outperform existing CAM methods, both with and without adaptive thresholding. We expect MetaCAM to be of particular use in high-criticality fields.’
The new paper is titled MetaCAM: Ensemble-Based Class Activation Map, and comes from nine researchers across a panoply of research institutes in Ontario, Canada. Contributing institutions include the University of Ottawa, the Children’s Hospital of Eastern Ontario, The Ottawa Hospital, and the Prenatal Screening Ontario, Better Outcomes Registry & Network, at Ottawa, among others – an assembly that belies the extent to which the new work is interested in improving medicine-based CAM approaches.
Approach
Though it may seem easy to take a bunch of existing variations of a technology and average out their results, the nature of statistics, combined with the way that forks or variations on a technology may favor certain aspects over other characteristics, makes a project of this nature quite challenging.
To this end, the researchers have invented various new methods: Cumulative Residual Effects (CRE) is designed to help summarize large-scale multi-contributor output; and adaptive thresholding, which can not only help to provide meaningful and non-deceptive, non-destructive averages for MetaCAM, but can also be used for the individual older CAM approaches that are ensembled in the new architecture.
The original Grad-CAM approach creates heat-maps generated by linear weighting of feature maps produced by the last convolutional layer of a Convolutional Neural Network (CNN).

The problem with subsequent projects, as the authors of the new paper indicate, is that metrics for performance have evolved inconsistently. They state:
[Performance] may be evaluated in a given study by comparing visualizations between various CAMs. Quantitatively, various performance metrics have been proposed including perturbation analysis, object localization and segmentation, and human trust/class discrimination, making relative CAM ranking infeasible.
‘Furthermore, the performance of CAM methods varies across the parameters of individual experiments, such as the chosen images, their target classes, and the CNN model.’
Though the possibility of concatenating results from diverse CAM methods has been assessed before, in the CodCAM project, the authors consider that the selection of variants were ill-considered relative to the potential for ensemble-based approaches.
MetaCAM evaluates the number of pixels that are in highest agreement when a group of CAM methods is set to a problem. This involves not relying on the internal evaluation of each framework, but on reviewing the output critically, and evaluating in an objective manner what the significance of the outcome is, at the same time considering the known constraints of each framework.
Simply averaging out what each architecture considers to be its optimal result would be to take the usefulness of the results at face value. Additionally, in statistical terms, if one of the frameworks should produce a result that is of notably lower quality than the average of the rest of the contributing modules, this anomaly could skew the results in a way that devalues better-suited results from a more representative majority of the modules.
MetaCAM uses a total of 11 individual prior CAM frameworks. However, since it is computationally expensive to run 11 such projects simultaneously, the architectures are run in one of six groups of two or fewer, and the results evaluated afterward. Since they are not competing directly at runtime, this bottleneck could conceivably be addressed by the use of greater and more powerful computing resources.
To avoid unbalanced statistics, as described above, the project uses the Remove and Debias (ROAD) strategy, which employs a ‘Noisy Linear Imputation’ algorithm to make a superior selection of pixels from available results.

However, MetaCAM is not a mere wrapper for ROAD. The authors point out:
‘While a weighted average of CAMs may improve Meta-CAM performance over equal-weighting, poor-performing CAMs are not entirely removed from the overall formulation of MetaCAM and may still negatively affect performance. For this reason, we opt for a consensus-based Meta-CAM formulation.’
The adaptive thresholding method used by MetaCAM takes an average from the top-K% of pixels that are in agreement from the results of all the contributing CAM methods. The activation maps (i.e., heat-maps) are summed, and a threshold applied below which any contributing pixel will score zero, while the rest contribute to the overall score.
Therefore, the authors assert, ROAD’s results are calculated only across these best-scoring percentages of highly activated pixels for each activation map. ROAD uses a pixel perturbation saliency method, and MetaCAM averages out its results across incidents of 20%, 40%, 60% and 80% of perturbation.
Data and Tests
To test the new system, the authors chose diverse images from the ImageNet ILSVRC 2012 validation dataset, along with some example images traditionally used in CAM testing as benchmarks. Images were sized and cropped from 256x256px to 224x224px, and normalized.
ResNet152 and DenseNet161 models were used, each pretrained on the ImageNet-1K dataset supplied by PyTorch, comprising 1,000 possible classes.

Given the limited computing resources mentioned earlier, the authors point out that the choice of which CAM variants to run in a particular group should not be arbitrary, and therefore the groupings were established on the basis of the methodological similarity of the frameworks.

Each individual experiment was allocated either a NVIDIA P100 Pascal (16GB VRAM) or V100 Volta GPU (32GB VRAM), obtained through the Digital Alliance HPC infrastructure.
The aforementioned Cumulative Residual Effect (CRE) score devised by the authors is used to calculate the influence that any one of the six test groups had on the ultimate ROAD scores. The authors state:
‘CRE determines the relative positive/negative effect of each CAM group by taking the residual of the individual MetaCAM score with the median value of all scores for the m = 64 experiments and sum this residual (either a positive or negative value) with each contributing CAM groups within that experiment.
‘This produces a group-wise summary representing the relative impact of including/excluding that CAM group; all CAM groups are included in exactly 32 experiments and excluded in exactly 32 experiments.’

The researchers compared MetaCAM against individual scores from the competing visualization methods that constitute the framework itself:

Of these results, the researchers state:
‘MetaCAM outperforms all individual CAM methods, shown by the largest peak at k = 10. Most CAMs have a peak performance using a threshold between the top 10%-30% of most highly activated pixels. This indicates that adaptive thresholding is able to improve performance of all visual explanation methods, by selecting the top relevant pixels for a given (x, c, f (·)).’
In order to fully visualize the most performant MetaCAM groups (i.e., of sub-applications that work best together) across classes in varying test datasets, the authors literally ROAD-tested the combinations:

They comment:
‘We note that our consensus-based approach reliably detects the target class in all cases with dramatic improvements in ROAD score over individual CAMs.’
Examining the effects of adaptive thresholding across all 11 CAM methods included in the study, the authors found that the sub-methods tend to feature 15%, 30% and 45% of activated pixels. The thresholding technique developed for MetaCAM, on the other hand, reduces the number of non-pertinent activated pixels.

Regarding this, the authors observe:
‘Many of the original CAM visualizations activate large regions of the image, including both the cat and dog despite only using the cat class ID (281) as the target. Adaptive thresholding is able to refine the activations of all CAMs to focus on the desired target class.’
Conclusion
It’s notable that MetaCAM has arisen out of the need of the medical imaging research sector for improved examination and evaluation of results from trained systems. This is critical, since this scope brings the kind of backing necessary for worthwhile effort and advances. By contrast, at least in this period of time in this most nascent of technologies, the impetus to advance the state-of-the-art based solely on XAI concerns is not likely to be able to match the kind of funding that MetaCAM’s objectives can potentially engage.
It’s possible, however, that later legislation, not least in the EU, will force the current ‘wild west’ spirit of the image synthesis research sector to take this kind of investigative method more seriously. For sure, an improved understanding of activation in the latent space is likely to bring benefits across all sectors.