The Struggle for Salient Image-Cropping in Generative AI

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Anyone who has ever attempted to create images of people in a Latent Diffusion Model (LDM) such as Stable Diffusion will likely have inadvertently generated ‘over cropped’ or poorly-composed images such as these:

Examples of the way that Stable Diffusion will arbitrarily crop featured subjects, which reflects the equally random cropping of the contributing dataset to conform to the square 512px x 512px training format. Source: Stable Diffusion V1.5.
Examples of the way that Stable Diffusion will arbitrarily crop featured subjects, which reflects the equally random cropping of the contributing dataset to conform to the square 512px x 512px training format. Source: Stable Diffusion V1.5.

Though some of these types of ‘miscomposition’ may have an amusing artistic flavor, mostly they just look like someone tripped up or nudged the photographer at the crucial moment.

Such unfortunate renderings are not limited to Stable Diffusion, either, as frequent users of the OpenAI DALL-E generative series of models have found over the last year or so. This is because the training process for generative models share a common flaw: the millions of web-scraped images that are used to train the models are not usually square, while the industry standard for training AI models is most definitely square.

So, in the case of Stable Diffusion, as has been noted more than once, the model’s tendency to ‘cut off heads’ and mis-frame subjects happens because the model has been fed automatically cropped versions of the source data.

One popular GitHub thread for the AUTOMATIC1111 Stable Diffusion web UI highlights a shortcoming of the automatic way that the LAION database images were cropped before being passed to the model for training. Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2207
One popular GitHub thread for the AUTOMATIC1111 Stable Diffusion web UI highlights a shortcoming of the automatic way that the LAION database images were cropped before being passed to the model for training. Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2207

If you pass a model such data, it is going to presume, at least occasionally (after all, the data occurred occasionally), that that’s what you want to output; and it’s only because, statistically, there are enough good compositions and centralized shots that can survive arbitrary cropping, that the average output is better than this, on balance.

Buckets and Squares

As one would expect, there has been some effort in the research sector to remediate the problem.

In regard to the need for training images to actually be square, it should be noted that square formats accommodate standard PC hardware well because they respect the way that the hardware and the software written for it always scales up in multiples of 2.

Thus the standard training formats for computer vision models are 32px2, 64px2, 128px2, 256px2, and 512px2 – with increasing use of the more VRAM-hungry intermediate 768px2 and the more logically-consistent 1024px2 format.

The progressive history of the ability of the DeepFaceLab software to output higher and higher resolutions – and the output sizes are consistent with the sizes of the trained images, always in powers of two. Source: https://github.com/iperov/DeepFaceLab and https://metaphysic.ai/deepfakes-go-high-res-deepfakers-handle/
The progressive history of the ability of the DeepFaceLab software to output higher and higher resolutions – and the output sizes are consistent with the sizes of the trained images, always in powers of two. Source: https://github.com/iperov/DeepFaceLab and https://metaphysic.ai/deepfakes-go-high-res-deepfakers-handle/

This proceeds from the way that standard computers address memory, in powers of two, and the way that consumer and business computing hardware accords with this schema. Thus 13kb of data will still occupy a 16kb allocation, because that’s as granular as it gets in standard, non-quantum architecture.

In terms of getting the most out of limited machine learning resources, you might as well occupy the entire available block – and that’s always going to give you a square image space.

Batch sizes for source training images are limited by the available blocks in a GPU, and so you cannot add arbitrary margins either to the images or to the available memory occupancy. Source: https://blog.metaphysic.ai/future-autoencoder-deepfakes/
Batch sizes for source training images are limited by the available blocks in a GPU, and so you cannot add arbitrary margins either to the images or to the available memory occupancy. Source: https://blog.metaphysic.ai/future-autoencoder-deepfakes/

However, there is nothing to stop systems from breaking up images into smaller sectors, but treating the re-assembled image as a single training entity. This process, called bucketing, is currently gaining momentum in the training schemas of generative systems, and is presently used (optionally) by the SD Scripts/Kohya Low-Rank Adaptation (LoRA) open source frameworks, which allow users to train ‘sidecar’ files that can be inserted into Stable Diffusion output, without having to fine-tune the whole model.

Many of the thousands of models available online are in LoRA format, which can make use of non-square training images. Source: civitai.com
Many of the thousands of models available online are in LoRA format, which can make use of non-square training images. Source: civitai.com

The trouble with bucketing is that, as the name suggests, it groups similar images, and it does so by ratio and dimensions (they’re not the same thing, since a 2:3 ratio image could have any possible height and corresponding width). Therefore, if 2000 training images are one ratio, and 100 are a different ratio, training can become adversely affected, not least in terms of the time it takes, because the marginal sizes get equal attention, but represent less data. If the ratios and the image sizes are varied, the problem is compounded.

In this statistics-based example from Google, we see that the data does not fit equally into the allocated buckets. Source: https://developers.google.com/machine-learning/data-prep/transform/bucketing
In this statistics-based example from Google, we see that the data does not fit equally into the allocated buckets. Source: https://developers.google.com/machine-learning/data-prep/transform/bucketing

Therefore, though bucketing does not require square images, it’s non-optimal for collections with unbalanced distributions of image ratio. Though techniques such as quantile bucketing can help to obtain a more even distribution for purely numerical data, it doesn’t apply as well for vision-based systems which are essentially hijacking the bucketing process to cut up non-standard image sizes.

Subsequently, the latest wisdom, at least on LoRA training, is to choose two, perhaps three standard ratio formats and cut your images into those formats, to reduce the likelihood of unbalanced buckets. And this just brings us right back to the issue of intelligently cropping source images, instead of simply using them in the state they were found.

Intelligent Cropping

It’s a problem which has received some attention in recent years, for instance with the 2014 smartcrop.js project, which has recently been added to the free Bulk Image Resizing Made Easy (BIRME) platform, popular with casual ML enthusiasts as a quick method of conforming various-sized image collections into AI-friendly square formats.

Left, an example of determining salient parts of an image with the smartcrop.js JavaScript library, and right, the algorithm in action in the BIRME platform. Sources: https://29a.ch/2014/04/03/smartcrop-content-aware-image-croppingand https://www.birme.net/
Left, an example of determining salient parts of an image with the smartcrop.js JavaScript library, and right, the algorithm in action in the BIRME platform. Sources: https://29a.ch/2014/04/03/smartcrop-content-aware-image-croppingand https://www.birme.net/

The library is algorithmic rather than AI-trained. Though the project’s GitHub repository currently states that an ML-version is in the works, the appeal of (arguably) dumb systems such as this is that they are likely to run quite fast, and with reasonably economic resources, while neural network inference is likely to slow this down dramatically (though it could be used to produce a better ‘flat’ algorithm).

An example of face-seeking in smartcrop.js, which could potentially reduce the kind of mis-cropping that has plagued the output of latent diffusion models. Source: https://29a.ch/sandbox/2014/smartcrop/examples/testbed.html
An example of face-seeking in smartcrop.js, which could potentially reduce the kind of mis-cropping that has plagued the output of latent diffusion models. Source: https://29a.ch/sandbox/2014/smartcrop/examples/testbed.html

This is a crucial factor, because though solutions to the smart cropping problem abound, no solution that is rational and economic has yet come to light. Therefore, as any professional or hobbyist ML practitioners will know well, cropping remains sadly arbitrary in even very recent and cutting-edge frameworks.

This either places a notable burden of data pre-processing onto project teams, or else requires them to consider post-processing or post-training solutions, or other measures designed to mitigate the semantic damage that mis-cropped images wreak on training models.

Part of the problem is that the definition of ‘salient content’ may vary between projects. Though it may surprise those with an interest in deepfakes or neural facial synthesis, model creators may wish to concentrate on other parts of an image of a person than their face:

Jewelry and fashion datasets and models, among many other possible categories of computer vision research, may de-emphasize faces in favor of other parts of a photo, whereas rote systems such as smartcrop.js cannot target such items. Source: Google Images
Jewelry and fashion datasets and models, among many other possible categories of computer vision research, may de-emphasize faces in favor of other parts of a photo, whereas rote systems such as smartcrop.js cannot target such items. Source: Google Images

Additionally, it may be desired to focus on parts of a face, on other aspects of clothing, or on any myriad number of other possibilities that don’t fit a facial synthesis or facial recognition pipeline.

Once you begin to consider non-human subjects such as buildings and environments, the problem of what a ‘salient’ element is magnifies, since the considerable bias in the datasets that power human neural depiction is no longer available as an easy shortcut.

For instance, in the example below, a cropping algorithm is attempting to second-guess the compositional skills of the original photographer, whose work is seen on the far left. A logical crop, created by a human as ground truth, is seen on the far right, and the relatively poor guess that the system made is seen in the middle column.

Cutting down trees the wrong way. Source: https://arxiv.org/pdf/2310.08892.pdf
Cutting down trees the wrong way. Source: https://arxiv.org/pdf/2310.08892.pdf

A Compositional Approach

The image above is an exceptional failure case featured at the end of an otherwise interesting new paper that offers a novel approach to image cropping, which takes into account the twin needs to both conform to particular ratios (i.e., for web or print design) and to perform the best crop possible within the chosen ratios.

Conforming crops to preset aspect ratios with the best possible balance between aesthetics and conformity.
Conforming crops to preset aspect ratios with the best possible balance between aesthetics and conformity.

Though the system is not aimed directly at the needs of dataset preparation, the way that bucketing can take advantage of ratios, among other broader possibilities in computer vision and generative applications, makes it an interesting potential method of preparing datasets.

Above all, the new work (which is titled Image Cropping under Design Constraints, and comes from three authors across the University of Tokyo and Cyberagent.Inc in Japan) considers well the fact that workable ML-based solutions to the cropping conundrum need to conserve resources and to operate quickly. If it were not so, one could run images with inconvenient ratios programmatically through a generative system such as Adobe’s Firefly, or Stable Diffusion, and simply add any necessary content that might fill out the ratio shortfall.

Besides the electricity and server bills, the wait times for such an approach are unthinkable at scale; and even more modest AI-based solutions come with severe caveats in regard to processing times and potential cost, if dealing with more than a few hundred images.

Therefore the authors of the new paper offer two approaches: a score-based method, which performs quite intensive evaluation on source images and which obtains good results, while taking up only moderate resources; and a heatmap-based method, which is more effective, but comes at a higher compute cost.

Method

The new work addresses the twin considerations of aspect ratio and a layout condition, the latter of which represents a disposition of diverse elements within the composition, defined by bounding boxes.

‘Our objective’, the authors state. ‘is that a result of cropping [an image] is aesthetically improved from the original input image while satisfying the two conditions’.

The system must conform source images both to predefined ratio formats and to saliency of content.
The system must conform source images both to predefined ratio formats and to saliency of content.

For the score-based function approach, the system uses the pretrained Grid Anchor based Image Cropping (GAIC) system, a PyTorch framework which reduces the traditional millions of possible choices to several hundred, and iterates down to the optimal choices. GAIC therefore provides the proposal-based aesthetics scores for the source images.

However, the implementation is modified in the new system by increasing the number of possible proposals, and imposing requirements for a minimum size of height and width in a linear fashion, and seeking to use the largest possible area of the source image within all the constraints.

Schema for the proposal-based approach, which evaluates scores under the GAIC system.
Schema for the proposal-based approach, which evaluates scores under the GAIC system.

For the heatmap-based approach, the system adapts the 2022 Human-centric Image Cropping architecture, wherein aesthetic information is extracted via a deep neural network to produce Grad-CAM-style heatmaps, which emphasize salient areas in grayscale (though this is frequently converted in practice to a transparent color overlay, for utility).

Grayscale heatmaps indicate the inferred salient areas.
Grayscale heatmaps indicate the inferred salient areas.

The authors state:

‘We train a neural network to predict the heatmap. We assume that the heatmap includes sufficient aesthetic information for image cropping and evaluate each cropping result without a repeat of neural network computation using the heatmap.’

Crops obtained via heatmaps.
Crops obtained via heatmaps.

Data and Tests

For evaluation purposes, the authors generated a new version of the FLMS dataset, an image-cropping collection associated with the 2014 Dartmouth/Adobe paper Automatic Image Cropping using Visual Composition, Boundary Simplicity and Content Preservation Models).

The authors made selections from the original 500 images in the FLMS dataset that satisfied the layout conditions that they wished to impose, and which had suitable ground truth examples.

Eight types of bounding boxes were considered*:

‘[We] place four narrow boxes along each image side and four large boxes by dividing images from a center point with vertical and horizontal lines. We expect that these blanks are useful for the placement of something like text elements or logos. Further, we add aspect ratio conditions by computing aspect ratios from the bounding box of ground truth.

‘Then we obtain the set of input images with design constraints and ground truth of outputs. For each pair, when the ground truth region encompasses a layout pattern, we simply retain the aspect ratio of the ground truth region as an input condition. Through this process, we achieve a set comprising the image, the layout pattern, the aspect ratio, and the ground truth region.’

Example images from the evaluation dataset derived from the FLMS dataset. On the left, we see the original image and the templates for the layout conditions. Heading right, the red boxes visualize the imposition of the layout conditions, and the blue boxes the ground truth regions (i.e., human-evaluated crops).
Example images from the evaluation dataset derived from the FLMS dataset. On the left, we see the original image and the templates for the layout conditions. Heading right, the red boxes visualize the imposition of the layout conditions, and the blue boxes the ground truth regions (i.e., human-evaluated crops).

In total, 4,426 sets of design constraints and related crops were gathered for the subset.

The researchers devised a baseline method for design-constrained image cropping. In accord with previous works, TranSalNet saliency maps were used. The baseline therefore consists of a target layout mask and a saliency map mask, the needs of which must be balanced during the process.

The two derivative demands are called Saliency & Short Edge and Saliency & Long Edge.

An example of how the imposed desired constraints manifest in a practical case.
An example of how the imposed desired constraints manifest in a practical case.

The primary evaluation metric was Intersection over Union (IoU), which is commonly used in evaluating positions and other qualities for adjacent or overlapping bounding boxes.

The feature extractor used was VGG16, and the model was trained on the Adam optimizer with an initial learning rate of 3.5e-4, and a weight decay of 1e-4, running every 5 epochs for a total of 30 epochs (an epoch being a complete tour of the entirety of the data by the training routine).

The testing dataset was the CPCDataset collection, from the Stonybrook/Adobe Good View Hunting project.

For the backbone of the proposal-based (i.e., score-based) approach, a pre-trained GAICv2 model was used, and all models were trained on a sole NVIDIA Tesla T4 with 16GB of VRAM.

The new system comfortably outperformed the proposed baseline:

Results of comparisons for the FLMS dataset.
Results of comparisons for the FLMS dataset.
Qualitative comparisons, with the left image demonstrating the layout requirements. The central four images show the results under varying design constraints, with the ground truth visualized in the far-right column.
Qualitative comparisons, with the left image demonstrating the layout requirements. The central four images show the results under varying design constraints, with the ground truth visualized in the far-right column.

The authors comment:

‘We observe that the proposed approaches outperform the baseline quantitatively and qualitatively. The baseline tends to lose aesthetics for satisfying a given aspect ratio and a specific layout while our approaches find a good view in the areas that satisfy given conditions. The heatmap-based approach shows the better score than the proposal-based approach.

‘These results indicate that the score functions are effective for image cropping under design constraints, and the heatmap-based approach can achieve better optimal comparing the proposal-based approach.’

However, the size of the bounding boxes is enmeshed with higher computation resources needed, and with latency, meaning that the heatmap-based approach is more demanding and, dependent on hardware resources, will take longer to execute.

The paper concludes:

‘[The] heatmap-based approach has the trade-off of the performance and computation cost in the iteration of optimization, and fine fitting leads to large improvement comparing the proposal-based approach, though it requires a high computation cost, e.g., iteration 500 takes 20 seconds.

‘The proposal-based approach achieves better performance than the heatmap-based approach under the same computation cost, while the heatmap-based approach achieves better performance with more computation cost.

‘This result indicates that balancing aesthetically plausible areas and satisfying multiple conditions is not a trivial task and requires sensitive balance, and both proposed approaches are reasonable alternatives.’

Conclusion

Since the human face (and body) is an upended rectangle, and the world of landscapes and architecture tend to favor horizontal rectangles, popular photography ratios remain at war with the square formats that are best-suited for computer vision and generative AI training.

In a way, the tendency towards a non-square lens format is in itself a form of cropping: medium-format high-end cameras such as Hasselblad were technology lines dedicated to the square ratio, while non-rectangular formats have not always been absent from consumer hardware.

However, so long as binary computing systems dictate that square formats are the optimal use of hardware in the training of vision and generative systems, it seems that compromises will continue to be necessary, even if it is just the crude and desperate tactic of adding black borders to rectangular images so that they will accommodate the square format (which is a relatively common practice, though it can lead to borders being produced occasionally at inference time).

There are many other possible avenues of research that could lead to better cropping systems for data preparation, such as text-prompted frameworks that are capable of recognizing objects, body parts, etc., and will favor these when cropping; but they tend to come at a significant cost.

If you’re looking to exclusively extract faces, the problem is long-since solved, as deepfake applications such as DeepFaceLab and FaceSwap will automatically seek out facial alignment landmarks and crop the resulting image as closely to the landmarks as possible while conforming to a square format.

But there are many possible objects of training than just isolated faces, and it could be that compositional approaches such as the one suggested in the new Japanese paper may contribute something valuable to the struggle – at least until such time as the 1:1 ratio will seem like a barbaric requirement, due to pending breakthroughs – or even radical technological shifts, such as quantum training.

* See page 13 of the source paper for examples of these boxes, though the illustration is too large to reproduce here.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle