Multiple Cropping of Images May Improve AI Model Performance

Illustration for multiple cropping article, featuring Ryan Gosling and the SMPL-X example woman

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

There seems to be growing evidence that computer vision and generative models could benefit from a strange and unintuitive new method of cropping the source photos in training datasets, and combining the results from models trained with each type of crop.

The proposed idea is that by merging together multiple trained models, each of which was trained with a differently-cropped variant on the same dataset, the merged model demonstrates improved dimensionality, versatility and performance – though exactly why this occurs is not entirely certain yet; nor is the hypothesis fully proven.

Left, example crops that follow the methodology in the Reddit post; right, a similar cropping scheme that has emerged in a new paper. (https://arxiv.org/pdf/2402.02074.pdf).
Left, example crops that follow the methodology in the Reddit post; right, a similar cropping scheme that has emerged in a new paper. (https://arxiv.org/pdf/2402.02074.pdf).

For example, on the left side of the image above, we see an instance of this: leftmost is the base source image, of actor Ryan Gosling; to the right of these are three crop variants of the same image, each of which is intended to be included in a separate and distinct training session, and each of which will result in a different model. It’s by merging these models that the superior performance can apparently be obtained.

This method was apparently first presented in a post at the Stable Diffusion sub-Reddit last September (the details of which, we’ll look at momentarily).

In the right-hand part of the image above, we see a similar approach, which is presented in a recent paper that offers an apparently-improved method of mesh recovery (obtaining a 3D representation of a human from flat images), also by utilizing multiple crops of a single image.

It's Not Overfitting

If you’re at all familiar with how AI model training works, this all sounds rather like overfitting (a stage beyond memorization) – the fact that a trained model will tend to produce a particular image if you present that image to it excessively during training.

Overfitting is an undesirable result for generative or interpretive models, because the objective of training is to assimilate a flexible knowledge of the trained data, and to become capable of novel output – not to merely reproduce the training data.

An overfitted model will not perform well on novel tasks at inference time, and will tend to ‘bleed’ memorized features and traits into unrelated generations.

In fact, even ‘default’ Stable Diffusion models (i.e., official releases), research has found, have been so over-exposed to repeated instances of ‘popular’ images that it is trivial to reproduce the training data in casual generations:

Examples of memorization in Stable Diffusion V2.1. Above are user-prompted images, below, equivalent source images used to train the model. Researchers found that repeated instances of certain source images easily allow the user to re-invoke the source data in a text-prompted generation. Source: https://arxiv.org/pdf/2305.20086.pdf
Examples of memorization in Stable Diffusion V2.1. Above are user-prompted images, below, equivalent source images used to train the model. Researchers found that repeated instances of certain source images easily allow the user to re-invoke the source data in a text-prompted generation. Source: https://arxiv.org/pdf/2305.20086.pdf

By contrast, concatenating a ‘compound’ system from different models trained on different crop sizes appears to facilitate the reverse effect to overfitting – to make the output more versatile and disentangled, and more useful to the end-user.

Multiple Takes

The method outlined last year in the Reddit post is aimed at improved performance of LoRA models – relatively small adjunct files which can be used to ‘insert’ people and objects into Stable Diffusion, even though these subjects were never originally trained into the officially released models. Yet the principle itself may extend beyond this very specific use.

If a user wants to place themselves into Stable Diffusion output, they feed images of themselves into a framework such as Kohya (either locally installed, or in an online version, such as Colab), train the model for only a few hours, on relatively modest consumer hardware, and apply the resulting LoRA model to Stable Diffusion, in popular frameworks such as AUTOMATIC1111 and ComfyUI.

A Stable Diffusion enthusiast inserts himself into text-to-image output via use of a LoRA. Source: https://old.reddit.com/r/StableDiffusion/comments/15dl7ul/lora_dreamboothd_myself_in_sdxl_great_similarity/
A Stable Diffusion enthusiast inserts himself into text-to-image output via use of a LoRA. Source: https://old.reddit.com/r/StableDiffusion/comments/15dl7ul/lora_dreamboothd_myself_in_sdxl_great_similarity/

Until fairly recently, the general standard was to create crops of source images in sizes and ratios that will be familiar to the veteran AI enthusiast: 256×256, 512×512, 756×756 – and even (if you have adequate hardware) 1024x1024px.

The respective differences among 'standard' training sizes.
The respective differences among 'standard' training sizes.

The advent of bucketing (which is implemented by default in Kohya, for LoRA creation) made this kind of rigor unnecessary; and it is now possible to feed in very large images at any ratio. Bucketing can address even very large images in a piecemeal fashion, until all necessary features are obtained.

However, this does not dispense with the advantages of ratio consistency: each bucket is assigned to only one ratio; therefore, if your training set consists of twelve 512x512px images, four 768x768px images, and just one 1024x1024px image (for example), each of these gets their own dedicated bucket, and therefore the attention is not evenly distributed across the run of images, which may adversely affect training.

In LoRA training, the system can break down large images into smaller sections and reassemble them, allowing the user to provide any combination of image ratios and dimensions. But uneven distribution of ratios in the source data can lead to uneven attention during the training process.
In LoRA training, the system can break down large images into smaller sections and reassemble them, allowing the user to provide any combination of image ratios and dimensions. But uneven distribution of ratios in the source data can lead to uneven attention during the training process.

The dimensions represented in the above illustration are arbitrary – in bucketing, a new bucket is created for each actual ratio (such as 5:4, or 16:9), which can lead to dozens, or even hundreds of unevenly-distributed buckets, in a truly ad hoc dataset.

The Reddit method, instead, proposes a radical stricture: that you curate a dataset and create at least two cropped versions of it – for example, one dataset at 512x512px, and a second version at 512×768. The post (and some comments related to it) suggests that a third dataset – for instance, 756x512px – can bring even more benefits; though the addition of even more variants is reported to yield diminishing returns:

Left, the original source images (many of which are much larger than represented here, but shrunk for convenience). Right, three possible cropped versions of the dataset, each of which is intended to be trained into its own model.
Left, the original source images (many of which are much larger than represented here, but shrunk for convenience). Right, three possible cropped versions of the dataset, each of which is intended to be trained into its own model.

Each of the two or three strictly-cropped dataset versions is trained normally (and separately) as a LoRA, with no mixing of aspect ratios among the three sets.

As usual, the best of the multiple checkpoints (automatically saved at various stages of training) is selected, for each trained model. The chosen model will represent the optimal compromise between versatility and detail, since lesser-trained models are more versatile, while higher-trained models tend to be more memorized and inflexible – but can usually offer better detail.

Kohya features a merge facility, where the user can munge together multiple different trained models that share training parameters, in the hope of combining the best aspects of each into one single, superior model.

Obviously, one can specify how much influence each model should contribute at merging time. For instance, when merging A and B models, the ratio of influence could be 20/80, or 50/50, or whatever the user wishes.

The new method instead, requires that each model’s merge contribution be set to 100%, giving the merged model a power of (an impossible) 300%:

The merge settings in Kohya, set to an impossible 300% for the final merged model.
The merge settings in Kohya, set to an impossible 300% for the final merged model.

With the resulting LoRA’s strength set to a standard 1 (i.e., 100%), the merged model produces noisy garbage. But when using a LoRA strength proportional to the overclocked merge settings (0.4 for a 2x merge, 0.1 for a 3x merge), the LoRA operates normally – and produces outstanding results.

Having tested this method since it was published, I can anecdotally confirm its efficiency – that the final merged models created by this method perform outstandingly, relative to standard usage and practice, and exhibit minimal entanglement compared to conventional results.

However, in the absence of empirical tests (since the perceived quality of results is subjective, and can be difficult to quantify and assess), it has until now been impossible to discuss or report on the possibilities of this ‘multiple crop’ approach.

Only now that something similar has emerged in the scientific literature – the aforementioned research paper, which we will address momentarily – do I feel that some hard evidence may be emerging of a novel and relatively unexplored relationship between the training of varied source image data ratios and superior dimensional performance in a subsequently concatenated model.

In the original post, an explanation for the phenomenon is copy/pasted by an apparent associate of the poster, described as ‘someone with a PHD in Machine Learning’*:

‘By training on different aspect ratios and then merging the models, you’ve essentially created a kind of “ensemble” model that brings together the strengths of each individual training run

‘The fact that it performs well at a lower strength but not so much at full strength makes sense. It’s like you’re damping down the noise each model learned from its specific aspect ratio, and what’s left is the signal they both agreed upon. So you get the common, well-generalized features without the quirks that led to overfitting in the first place.

‘With the models also achieving the desired output at low strengths, it makes the LORAs more versatile if they are to be used with others. Its a pretty unique approach that may have uses in other areas of ML.

The reported expert goes on to explain:

‘When you train a single model with two sets of differently cropped images, you are introducing “noise” in the form of conflicting or [confusing] signals for the model, making it harder for the model to generalize well. It’s like trying to listen to two songs at the same time—you’re less likely to enjoy or understand either one.

‘In contrast, when you trained two separate models and then merged them, each model had a chance to specialize and [fine-tunes] itself on its specific aspect ratio. Once you blended/merged them, you [got] the best of both worlds, especially at the low strength levels.

‘This lower strength is acting like a filter, dampening the idiosyncrasies specific to each individual training session while keeping the generalized features, hence improving the signal and reducing the noise.’

Typically, modern generative architectures feature annotated or captioned images. At the more extreme crops of the original source data, captions/tags will, under the Reddit method, need to be added or removed, depending on what the uncompromising crop ratios leave behind:

Tagging provides crucial information in the training or fine-tuning of text/image generative architectures. Here are some examples of how the WD14 automatic tagging system will produce varying results depending on each image crop, with sample 'unique' tags highlighted in red, for illustrative purposes. This makes caption curation, which is already a laborious and painstaking process, 3-4 times as arduous for the Reddit method. The same applies to alternative annotation methods such as CLIP and BLIP. WD14 tagging provided by https://huggingface.co/spaces/deepghs/wd14_tagging_online
Tagging provides crucial information in the training or fine-tuning of text/image generative architectures. Here are some examples of how the WD14 automatic tagging system will produce varying results depending on each image crop, with sample 'unique' tags highlighted in red, for illustrative purposes. This makes caption curation, which is already a laborious and painstaking process, 3-4 times as arduous for the Reddit method. The same applies to alternative annotation methods such as CLIP and BLIP. WD14 tagging provided by https://huggingface.co/spaces/deepghs/wd14_tagging_online

The Reddit method therefore offers an arduous process, and one which could be automated, at least to a certain extent.

Recently, as mentioned above, a new paper has utilized multiple cropping for a different purpose – but possibly exploiting the same, hitherto unknown principle of concatenated generalization.

Multiple Cropping in Human Mesh Recovery

The new paper deals with the challenge of single-image Human Mesh Recovery (HMR). Such processes have recently become very popular, since creating an interstitial CGI mesh to control neural generative output offers a level of control that most architectures don’t support natively.

Basically, the objective is to create a traditional CGI model from a single image, by evaluating that image ‘guessing’ what its topology is, and generating an estimated mesh. Systems such as FLAME and SMPL perform such operations (also see our article on 3D Morphable Models for a deeper look at the principles behind this).

A showcase example of the SMPL-X body estimation system. Source: https://smpl-x.is.tue.mpg.de/
A showcase example of the SMPL-X body estimation system. Source: https://smpl-x.is.tue.mpg.de/

(Our interest in this paper does not extend, this time, to a complete write-up of its methodology and tests, but rather in its exploitation of multiple cropping to improve the extraction of a mesh from a single image. The authors of the paper treat this as a novel approach, though I believe it tends to support the principles outlined in the September 2023 Reddit post)

What we’re concerned with in the new work is the way that multiple crops, similar to the Reddit method, are employed during regression of the source images (the process by which the mesh is estimated, based on features derived from the source image).

From the new paper, an illustration of the multi-crop schema in comparison to a typical single-crop regression scenario (top right).
From the new paper, an illustration of the multi-crop schema in comparison to a typical single-crop regression scenario (top right).

As we can see in the lower-left part of the paper’s schema illustration above, each of the varying crops are passed through to multiple ‘virtual cameras’, effectively equivalent to the Reddit method of combining the outcome of training on discrete crop ratios and arriving at multiple models, which are then merged into a single and more performant model.

The paper states:

‘The first relation we can leverage is that these multiple crops contain the same human with the same pose and shape. Therefore, we adopt contrastive learning to extract similar features from these crops. This idea is straightforward, as the content of these crops is dominated by the foreground human rather than the background.

‘Constrained by the contrastive learning loss, the network is encouraged to focus on the foreground human and extract discriminative features for different human with different actions.’

The authors further observe that different crops of the same image furnish slightly different information, and that it’s beneficial to ‘fuse the features’ of different crops in order to regress the target mesh.

Instead of being processed in separate and discrete external routines, as with the Kohya Reddit method, the features are assimilated separately in a contrastive learning module, before being fused in a crop-aware module:

From the new paper, an effective internal automation of the more rigorous and ad hoc Reddit method.
From the new paper, an effective internal automation of the more rigorous and ad hoc Reddit method.

Nonetheless the principle seems remarkably similar.

The new approach models pairwise comparisons between the estimated cameras, each of which has been informed by a single and distinct crop, and the authors assert that this ‘encourages the estimation of more accurate cameras’.

Please refer to the paper for further details of the methods used for the new work, most of which have no direct relation to the possibilities of creating better models by fusing multiple crops into a single entity, which is our central interest here. We’re concerned, instead, with the extent to which multiple crops can actually produce better results, compared to single-crop training methods.

To test their approach, the researchers followed standard practice in the current literature, and trained on a mixture of four datasets: Human3.6M; MPI-INF-3DHP; COCO; and MPII.

For evaluation purposes, the authors used the test sets of 3DPW and Human3.6M.

Metrics used were Human3.6M’s Mean Per Joint Position Error (MPJPE), Procrustes-Aligned MPJPE (PA-MPJPE), and mean Euclidian distance between mesh surfaces (PVE – please refer to the paper for further specifics of training, as these are not germane to the central proposition here).

For a qualitative comparison, the new method was pitted against FastMETRO, CLIFF, and ReFit – all single-crop methods.

Qualitative results for the comparison against rival (single-crop) frameworks.
Qualitative results for the comparison against rival (single-crop) frameworks.

The objective here is to match the inferred result as closely as possible to the target mesh, shown in green in the image above.

The authors comment:

‘The shown cases are challenging, which either contain complex poses or some part of the body is occluded by other body parts. For these cases, our estimated meshes resemble the [ground truth] (green color) better than results of the compared approaches.’

For quantitative tests, the new method was tested against an exhaustive list of near-equivalent frameworks, including base HMR, SPIN, SPEC, PyMAF, PyMAF-X, PARE, Hybrik, CLIFF, MPT, PLIKS, BoPR, ReFit, Deformer, NIKI, and Zolly.

Results from the quantitative round. R50/R34 signifies the use of a ResNet backbone; H48/H32/H64 signifies an HRNet (https://arxiv.org/pdf/1902.09212.pdf) backbone.
Results from the quantitative round. R50/R34 signifies the use of a ResNet backbone; H48/H32/H64 signifies an HRNet (https://arxiv.org/pdf/1902.09212.pdf) backbone.

Though the multi-crop method improves on nearly all prior methods, the authors suggest that particular attention be paid to the extent to which it improves on CLIFF, which is ‘the single-crop version of our method’.

The paper further illustrates the superiority of the new method on per action (image below, left) and joint (image below, right) comparisons on Human3.6M, observing that ‘[our] method outperforms FastMETRO and CLIFF on almost all kinds of actions and joints.’

Per action (left) and joint (right) comparison against the strongest and most apposite contenders, FastMETRO and CLIFF.
Per action (left) and joint (right) comparison against the strongest and most apposite contenders, FastMETRO and CLIFF.

Though we do not normally cover ablation studies, one such test in the new paper is relevant to our purposes, as the authors tested the extent to which additional crops improved accuracy and performance.

Ablation studies examined the extent to which additional crops improved accuracy and affected performance.
Ablation studies examined the extent to which additional crops improved accuracy and affected performance.

The paper states*:

‘An interesting point is how the number of crops influences the regression accuracy…As seen, the accuracy is consistently increased as the number of input crops increases. Experiments of inputting 6 or more crops are not conducted due to memory limit.

‘We find that as the crop number increases, the L2D loss per crop increases too, but we obtain higher regression accuracy.

This may indicate that inputting more crops prevents the network from [over-fitting].

Conclusion

The conclusion drawn here seems, therefore, to be the same as in the Reddit experiment – that multiple image crops, when siloed in some way (towards virtual cameras, in the new paper, or towards different models which will later be merged, in the Reddit method) not only do not cause overfitting, but actively combat it, and facilitate a more performant and flexible model.

It will be interesting to see if the research sector picks up further on multi-cropping, and goes even further into automation of these processes than the new work has done.

In both cases, the need for manual cropping (or at least, some level of oversight on the crops) remains a bottleneck to producing multiple crop silos at scale. In fact, hyperscale scenarios are not considered in either approach, since both LoRA and mesh recovery systems do not require a very high volume of images.

A further interesting pursuit might be to test the principle beyond the realms of computer vision, and to see if equivalent ‘crops’ could be delineated in Large Language Models (LLMs), and other types of machine learning system; and to see whether this three-in-one approach could likewise benefit them.

* My bold emphases.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle