Researchers Say They’ve Solved One of the Most Annoying Flaws in AI Art

In the past few years, artificial intelligence has made stunning strides in generating images, so much so that it’s often hard to differentiate the real from the AI images. But if you’ve ever asked an AI to generate a non-square image—perhaps a widescreen portrait or a banner-sized artwork—you might have encountered some eerie artifacts. Sometimes, these are subtle shadows or bizarre colors, other times they are extra fingers, warped faces, and oddly duplicated objects.

Now, a team of researchers at Rice University has tackled this problem with a new approach called ElasticDiffusion, a method that promises to make AI-generated images more consistent across different aspect ratios. If successful, it could mark a fundamental shift in the way AI images are created, eliminating one of the field’s most persistent flaws.

The picture on the left was generated by a standard method while the picture on the right was generated by ElasticDiffusion. The prompt for both images was, “Envision a portrait of a cute scientist owl in blue and gray outfit announcing their latest breakthrough discovery. His eyes are light brown. His attire is simple yet dignified” (Image courtesy of Moayed Haji Ali/Rice University)

The Achilles’ Heel of AI Image Generators

Generative AI models have dazzled the world with their ability to turn simple text descriptions into lifelike images. But these models are designed to work best with square images, they don’t really like unusual formats. When asked to create images at different sizes—such as the 16:9 aspect ratio used in many monitors or the tall, narrow proportions of a smartphone screen—they often struggle.

This limitation stems from the way these models learn. Diffusion models, the dominant approach in AI-generated images, start with a collection of real-world images, then add noise layer by layer—essentially distorting the original images beyond recognition. To generate a new image, the model reverses the process, slowly and iteratively removing the noise until a coherent picture emerges.

The problem is that these models have been trained primarily on square images. When forced to generate an image in a non-square format, they often duplicate image elements to fill the extra space, leading to the kind of surreal distortions that have made AI-generated hands an internet meme.

“If you train the model on only images that are a certain resolution, they can only generate images with that resolution,” said Vicente Ordóñez-Román, an associate professor of computer science who advised Haji Ali on his work alongside Guha Balakrishnan, assistant professor of electrical and computer engineering.

One potential fix for this problem is retraining these AI models on a wider variety of aspect ratios. But there’s a catch: training a diffusion model is incredibly expensive.

“You could solve that by training the model on a wider variety of images, but it’s expensive and requires massive amounts of computing power ⎯ hundreds, maybe even thousands of graphics processing units,” Ordóñez-Román said.

Overcoming This Problem

AI catsThe picture on the left was generated by a standard method while the picture on the right was generated by ElasticDiffusion. The prompt for both images was, “Photo of an athlete cat explaining its latest scandal at a press conference to journalists.” Image courtesy of Moayed Haji Ali/Rice University.

ElasticDiffusion solves the aspect ratio problem by separating two different types of image information:

Local information, which includes fine-grained details like the shape of an eye or the texture of fur.
Global information, which captures the overall structure of the image—such as whether it contains a person, a dog, or a tree, and how those elements should be arranged.

To demonstrate the power of ElasticDiffusion, the researchers tested it against a conventional diffusion model using the same text prompts.

One example was the request for a “photo of an athlete cat explaining its latest scandal at a press conference to journalists.” The conventional model produced an image riddled with odd duplications and visual artifacts, while the ElasticDiffusion-generated cat appeared more natural.

Another test involved generating an owl scientist in a dignified outfit, announcing a breakthrough. Again, the traditional model struggled with odd repetitions, while ElasticDiffusion produced a far cleaner and more natural image.

The results suggest that ElasticDiffusion could be a game-changer for the industry, which could lead to a new type of AI image generator, one that generates high-quality images at any aspect ratio without the usual weirdness. But there is a trade-off: ElasticDiffusion is currently slower than conventional diffusion models.

“It takes up to six to nine times longer for our method to generate an image compared to something like Stable Diffusion,” Haji Ali admitted. The team’s goal is to optimize the process so that it can generate images as quickly as existing models while preserving its accuracy.

“Where I’m hoping that this research is going is to define…why diffusion models generate these more repetitive parts and can’t adapt to these changing aspect ratios and come up with a framework that can adapt to exactly any aspect ratio regardless of the training, at the same inference time,” said Haji Ali.

You can check out the project demo here Project Demo and access the code here.

The results have been published in a peer-reviewed paper presented at the Institute of Electrical and Electronics Engineers (IEEE) 2024 Conference on Computer Vision and Pattern Recognition (CVPR) in Seattle.