Last week, the Swiss mechanical λογισμικού Matthias Buhlmann ανακάλυψε ότι το δημοφιλές μοντέλο σύνθεσης εικόνας Stable Diffusion θα μπορούσε να συμπιέσει υπάρχουσες εικόνες bitmap με λιγότερα οπτικά artifacts από τις μορφές JPEG ή WebP σε υψηλούς τόνους συμπίεσης, αν και προς το παρόν υπάρχουν ορισμένες απώλειες.
Stable Diffusion is an AI image compositing model that typically creates images based on text descriptions (called “prompts”). The AI model learned this ability by studying millions of images from the Internet. During the training process, the model makes statistical associations between images and related words, making a much smaller representation of key information for each image and storing them as “weights”, which are mathematical values that represent what the AI knows about pictures.
When Stable Diffusion analyzes and “compresses” the images in the form of weights, and these exist in what the researchers they call it "latent space", which is a way of saying that it exists as a kind of fuzzy data that can be turned into images once decoded. With Stable Diffusion 1.4, the weights file reaches about 4GB, but represents knowledge for hundreds of millions of images.
While most people use Stable Diffusion with text prompts, Buhlmann removed the text encoder and created his images through the Stable Diffusion image encoder process, which takes a low-resolution image. analysiss 512×512 and converts it to a higher precision latent space representation at 64×64. At this point, the image exists in a much smaller data size than the original, but can still be scaled (decoded) to a 512×512 image with fairly good results.
In testing, Buhlmann found that images compressed with Stable Diffusion subjectively looked better at higher compression (smaller file size) than JPEG or WebP equivalents.
The example above shows a photo of a pastry shop compressed to 5,68 KB using JPEG, 5,71 KB using WebP, and 4,98 KB using constant diffusion.
The image with Stable Diffusion appears to have more detail and less obvious compression artifacts than those with other formats.
However, Buhlmann's method currently has significant limitations:
It is not good with faces or text and in some cases, it can add features to the decoded image that were not present in the original image. No one of course wants to invent the image compressor that uses details that are not present in an image.
Also, decoding requires the weights file of Stable Diffusion which reaches 4GB and needs additional decoding time.
Buhlmann's code and more technical details are at Google Colab and Toward AI.