Anyone who has ever attempted to create images of people in a Latent Diffusion Model (LDM) such as Stable Diffusion will likely have inadvertently generated ‘over cropped’ or poorly-composed images such as these:
Though some of these types of ‘miscomposition’ may have an amusing artistic flavor, mostly they just look like someone tripped up or nudged the photographer at the crucial moment.
Such unfortunate renderings are not limited to Stable Diffusion, either, as frequent users of the OpenAI DALL-E generative series of models have found over the last year or so. This is because the training process for generative models share a common flaw: the millions of web-scraped images that are used to train the models are not usually square, while the industry standard for training AI models is most definitely square.
So, in the case of Stable Diffusion, as has been noted more than once, the model’s tendency to ‘cut off heads’ and mis-frame subjects happens because the model has been fed automatically cropped versions of the source data.
If you pass a model such data, it is going to presume, at least occasionally (after all, the data occurred occasionally), that that’s what you want to output; and it’s only because, statistically, there are enough good compositions and centralized shots that can survive arbitrary cropping, that the average output is better than this, on balance.
Buckets and Squares
As one would expect, there has been some effort in the research sector to remediate the problem.
In regard to the need for training images to actually be square, it should be noted that square formats accommodate standard PC hardware well because they respect the way that the hardware and the software written for it always scales up in multiples of 2.
Thus the standard training formats for computer vision models are 32px2, 64px2, 128px2, 256px2, and 512px2 – with increasing use of the more VRAM-hungry intermediate 768px2 and the more logically-consistent 1024px2 format.
This proceeds from the way that standard computers address memory, in powers of two, and the way that consumer and business computing hardware accords with this schema. Thus 13kb of data will still occupy a 16kb allocation, because that’s as granular as it gets in standard, non-quantum architecture.
In terms of getting the most out of limited machine learning resources, you might as well occupy the entire available block – and that’s always going to give you a square image space.
However, there is nothing to stop systems from breaking up images into smaller sectors, but treating the re-assembled image as a single training entity. This process, called bucketing, is currently gaining momentum in the training schemas of generative systems, and is presently used (optionally) by the SD Scripts/Kohya Low-Rank Adaptation (LoRA) open source frameworks, which allow users to train ‘sidecar’ files that can be inserted into Stable Diffusion output, without having to fine-tune the whole model.
The trouble with bucketing is that, as the name suggests, it groups similar images, and it does so by ratio and dimensions (they’re not the same thing, since a 2:3 ratio image could have any possible height and corresponding width). Therefore, if 2000 training images are one ratio, and 100 are a different ratio, training can become adversely affected, not least in terms of the time it takes, because the marginal sizes get equal attention, but represent less data. If the ratios and the image sizes are varied, the problem is compounded.
Therefore, though bucketing does not require square images, it’s non-optimal for collections with unbalanced distributions of image ratio. Though techniques such as quantile bucketing can help to obtain a more even distribution for purely numerical data, it doesn’t apply as well for vision-based systems which are essentially hijacking the bucketing process to cut up non-standard image sizes.
Subsequently, the latest wisdom, at least on LoRA training, is to choose two, perhaps three standard ratio formats and cut your images into those formats, to reduce the likelihood of unbalanced buckets. And this just brings us right back to the issue of intelligently cropping source images, instead of simply using them in the state they were found.
It’s a problem which has received some attention in recent years, for instance with the 2014 smartcrop.js project, which has recently been added to the free Bulk Image Resizing Made Easy (BIRME) platform, popular with casual ML enthusiasts as a quick method of conforming various-sized image collections into AI-friendly square formats.
The library is algorithmic rather than AI-trained. Though the project’s GitHub repository currently states that an ML-version is in the works, the appeal of (arguably) dumb systems such as this is that they are likely to run quite fast, and with reasonably economic resources, while neural network inference is likely to slow this down dramatically (though it could be used to produce a better ‘flat’ algorithm).
This is a crucial factor, because though solutions to the smart cropping problem abound, no solution that is rational and economic has yet come to light. Therefore, as any professional or hobbyist ML practitioners will know well, cropping remains sadly arbitrary in even very recent and cutting-edge frameworks.
This either places a notable burden of data pre-processing onto project teams, or else requires them to consider post-processing or post-training solutions, or other measures designed to mitigate the semantic damage that mis-cropped images wreak on training models.
Part of the problem is that the definition of ‘salient content’ may vary between projects. Though it may surprise those with an interest in deepfakes or neural facial synthesis, model creators may wish to concentrate on other parts of an image of a person than their face:
Additionally, it may be desired to focus on parts of a face, on other aspects of clothing, or on any myriad number of other possibilities that don’t fit a facial synthesis or facial recognition pipeline.
Once you begin to consider non-human subjects such as buildings and environments, the problem of what a ‘salient’ element is magnifies, since the considerable bias in the datasets that power human neural depiction is no longer available as an easy shortcut.
For instance, in the example below, a cropping algorithm is attempting to second-guess the compositional skills of the original photographer, whose work is seen on the far left. A logical crop, created by a human as ground truth, is seen on the far right, and the relatively poor guess that the system made is seen in the middle column.
A Compositional Approach
The image above is an exceptional failure case featured at the end of an otherwise interesting new paper that offers a novel approach to image cropping, which takes into account the twin needs to both conform to particular ratios (i.e., for web or print design) and to perform the best crop possible within the chosen ratios.
Though the system is not aimed directly at the needs of dataset preparation, the way that bucketing can take advantage of ratios, among other broader possibilities in computer vision and generative applications, makes it an interesting potential method of preparing datasets.
Above all, the new work (which is titled Image Cropping under Design Constraints, and comes from three authors across the University of Tokyo and Cyberagent.Inc in Japan) considers well the fact that workable ML-based solutions to the cropping conundrum need to conserve resources and to operate quickly. If it were not so, one could run images with inconvenient ratios programmatically through a generative system such as Adobe’s Firefly, or Stable Diffusion, and simply add any necessary content that might fill out the ratio shortfall.
Besides the electricity and server bills, the wait times for such an approach are unthinkable at scale; and even more modest AI-based solutions come with severe caveats in regard to processing times and potential cost, if dealing with more than a few hundred images.
Therefore the authors of the new paper offer two approaches: a score-based method, which performs quite intensive evaluation on source images and which obtains good results, while taking up only moderate resources; and a heatmap-based method, which is more effective, but comes at a higher compute cost.
The new work addresses the twin considerations of aspect ratio and a layout condition, the latter of which represents a disposition of diverse elements within the composition, defined by bounding boxes.
‘Our objective’, the authors state. ‘is that a result of cropping [an image] is aesthetically improved from the original input image while satisfying the two conditions’.
For the score-based function approach, the system uses the pretrained Grid Anchor based Image Cropping (GAIC) system, a PyTorch framework which reduces the traditional millions of possible choices to several hundred, and iterates down to the optimal choices. GAIC therefore provides the proposal-based aesthetics scores for the source images.
However, the implementation is modified in the new system by increasing the number of possible proposals, and imposing requirements for a minimum size of height and width in a linear fashion, and seeking to use the largest possible area of the source image within all the constraints.
For the heatmap-based approach, the system adapts the 2022 Human-centric Image Cropping architecture, wherein aesthetic information is extracted via a deep neural network to produce Grad-CAM-style heatmaps, which emphasize salient areas in grayscale (though this is frequently converted in practice to a transparent color overlay, for utility).
The authors state:
‘We train a neural network to predict the heatmap. We assume that the heatmap includes sufficient aesthetic information for image cropping and evaluate each cropping result without a repeat of neural network computation using the heatmap.’
Data and Tests
For evaluation purposes, the authors generated a new version of the FLMS dataset, an image-cropping collection associated with the 2014 Dartmouth/Adobe paper Automatic Image Cropping using Visual Composition, Boundary Simplicity and Content Preservation Models).
The authors made selections from the original 500 images in the FLMS dataset that satisfied the layout conditions that they wished to impose, and which had suitable ground truth examples.
Eight types of bounding boxes were considered*:
‘[We] place four narrow boxes along each image side and four large boxes by dividing images from a center point with vertical and horizontal lines. We expect that these blanks are useful for the placement of something like text elements or logos. Further, we add aspect ratio conditions by computing aspect ratios from the bounding box of ground truth.
‘Then we obtain the set of input images with design constraints and ground truth of outputs. For each pair, when the ground truth region encompasses a layout pattern, we simply retain the aspect ratio of the ground truth region as an input condition. Through this process, we achieve a set comprising the image, the layout pattern, the aspect ratio, and the ground truth region.’
In total, 4,426 sets of design constraints and related crops were gathered for the subset.
The researchers devised a baseline method for design-constrained image cropping. In accord with previous works, TranSalNet saliency maps were used. The baseline therefore consists of a target layout mask and a saliency map mask, the needs of which must be balanced during the process.
The two derivative demands are called Saliency & Short Edge and Saliency & Long Edge.
The primary evaluation metric was Intersection over Union (IoU), which is commonly used in evaluating positions and other qualities for adjacent or overlapping bounding boxes.
The feature extractor used was VGG16, and the model was trained on the Adam optimizer with an initial learning rate of 3.5e-4, and a weight decay of 1e-4, running every 5 epochs for a total of 30 epochs (an epoch being a complete tour of the entirety of the data by the training routine).
For the backbone of the proposal-based (i.e., score-based) approach, a pre-trained GAICv2 model was used, and all models were trained on a sole NVIDIA Tesla T4 with 16GB of VRAM.
The new system comfortably outperformed the proposed baseline:
The authors comment:
‘We observe that the proposed approaches outperform the baseline quantitatively and qualitatively. The baseline tends to lose aesthetics for satisfying a given aspect ratio and a specific layout while our approaches find a good view in the areas that satisfy given conditions. The heatmap-based approach shows the better score than the proposal-based approach.
‘These results indicate that the score functions are effective for image cropping under design constraints, and the heatmap-based approach can achieve better optimal comparing the proposal-based approach.’
However, the size of the bounding boxes is enmeshed with higher computation resources needed, and with latency, meaning that the heatmap-based approach is more demanding and, dependent on hardware resources, will take longer to execute.
The paper concludes:
‘[The] heatmap-based approach has the trade-off of the performance and computation cost in the iteration of optimization, and fine fitting leads to large improvement comparing the proposal-based approach, though it requires a high computation cost, e.g., iteration 500 takes 20 seconds.
‘The proposal-based approach achieves better performance than the heatmap-based approach under the same computation cost, while the heatmap-based approach achieves better performance with more computation cost.
‘This result indicates that balancing aesthetically plausible areas and satisfying multiple conditions is not a trivial task and requires sensitive balance, and both proposed approaches are reasonable alternatives.’
Since the human face (and body) is an upended rectangle, and the world of landscapes and architecture tend to favor horizontal rectangles, popular photography ratios remain at war with the square formats that are best-suited for computer vision and generative AI training.
In a way, the tendency towards a non-square lens format is in itself a form of cropping: medium-format high-end cameras such as Hasselblad were technology lines dedicated to the square ratio, while non-rectangular formats have not always been absent from consumer hardware.
However, so long as binary computing systems dictate that square formats are the optimal use of hardware in the training of vision and generative systems, it seems that compromises will continue to be necessary, even if it is just the crude and desperate tactic of adding black borders to rectangular images so that they will accommodate the square format (which is a relatively common practice, though it can lead to borders being produced occasionally at inference time).
There are many other possible avenues of research that could lead to better cropping systems for data preparation, such as text-prompted frameworks that are capable of recognizing objects, body parts, etc., and will favor these when cropping; but they tend to come at a significant cost.
If you’re looking to exclusively extract faces, the problem is long-since solved, as deepfake applications such as DeepFaceLab and FaceSwap will automatically seek out facial alignment landmarks and crop the resulting image as closely to the landmarks as possible while conforming to a square format.
But there are many possible objects of training than just isolated faces, and it could be that compositional approaches such as the one suggested in the new Japanese paper may contribute something valuable to the struggle – at least until such time as the 1:1 ratio will seem like a barbaric requirement, due to pending breakthroughs – or even radical technological shifts, such as quantum training.
* See page 13 of the source paper for examples of these boxes, though the illustration is too large to reproduce here.