Consider this block of rainbow static, picked at random from the space of all possible 256x256 8-bit RGB images when you opened this page:
It contains 256×256 pixels × 3 channels/pixel × 8 bits/channel = 1,572,864 bits of random information, making for 21,572,864 possible images that could have been shown. I'm confident in calling it a "block of rainbow static", even though many of the 21,572,864 possibilities don't match that description, like these four:
If sampling an image uniformly at random from the space of all possible images almost always gives us rainbow static, it follows that the vast majority of possible images are blocks of rainbow static. The images that aren't rainbow static - all the meaningful images that anyone might ever care to create or save or share with their friends - make up only an unimaginably tiny fraction of the whole space of possibilities.
The 1,572,864 bits that it takes to store a 256x256 image work out to 192KiB. 256x256 is a low resolution by modern standards. There are a lot of 1920x1080 images running around nowadays; an image at that resolution is going to be 5.9MiB. These files are getting impractically large.
Are you working with images sampled uniformly from the space of all possible images? Meaningless blocks of static, as it were? I'm afraid that you're going to have to deal with the large file sizes. I'm terribly sorry. You can stop reading now.
...
You're all still here?
Excellent.
If we're working with images picked uniformly at random, it's mathematically impossible to do better than 192KiB per image, but in practice people don't pick images uniformly at random. Some images (meaningful images) are more likely to be picked than others (rainbow static). It should be possible to develop a variable-length image encoding scheme that takes fewer than 1,572,864 bits to store the more common images at the cost of taking more than 1,572,864 bits to store some of the less common images.
Someone has already done the developing for us. Behold, the PNG spec.
PNG took the general purpose DEFLATE compression algorithm and added a preprocessing step to help DEFLATE see structure in images that it couldn't see otherwise. Encoding roughly consists of three steps. To summarize, working backwards:
Huffman coding: The image data stream is a series of integers in the range 0-255 representing how much red, green, or blue light to add to a pixel. Consider this image. 99% of the values are 0; the other 1% are picked from the range 1-255.
We've got a stream of values, each taken from a set of 256 possibilities, and we need to encode them as a string of bits. Our first impulse is to encode each value as one byte, but there's nothing saying we have to do it that way. Instead, let's encode the value 0 as the single bit 0
and all other integers as a 1
bit followed by an eight-bit encoding of the integer. The decoder will look like this:
def readPixelValue():
if readBit() == 0:
return 0
else:
return read8Bits()
99% of values are encoded in 1 bit and 1% are encoded in 9 bits, giving an average of 1.08 bits per value, a substantial reduction from 8 bits (1 byte) per value. (There's still a little room for improvement left because there are two ways to encode the value 0, namely the bit strings 0
and 000000000
, but you get the idea.)
If we had an image that was mostly 0s and 255s, we might say 00
→ 0, 01
→ 255, 1xxxxxxxx
→ xxxxxxxx, bumping 0 to a slightly longer code so that we can give 255 a shorter code.
Huffman coding is what you get if you take this line of thinking to its logical conclusion: an algorithm that, given the relative frequencies of the different values that you have to encode, will figure out what tradeoffs need to be made to make the best possible variable-length code to store your data, giving shorter encodings to more frequent values at the expense of taking more bits to store less frequent values. It's quite elegant, and if you want to know more Wikipedia explains it better than I can.
LZ77: Now consider this grayscale image, where I've put every value in the range 0-255 in a random order and then repeated that pattern several times.
Each row is identical to the row two positions above it. Because all the possible values occur equally often, Huffman coding can't help us, but there's clearly some structure here. LZ77 detects and compresses repeated sequences of values. The first two rows are written out normally, but after that each new row can be written as an instruction to look back 256 bytes in the data that's just been decoded and repeat 128 bytes from there. In fact, why not encode two rows at a time - "go back 256 bytes, repeat 256 bytes"?
My understanding of LZ77 is even more tenuous than my understanding of Huffman coding; the curious reader is once again directed to Wikipedia.
Row filters: Final example image.
Going left to right, then top to bottom, the values we need to encode are [0, 1, 2, ..., 254, 255, 255, 254, 253, ..., 1, 0]. There's structure but DEFLATE can't see it. No one value is more common than the others, so Huffman coding can't help. No sequence is repeated, so LZ77 can't help either.
Before passing pixel data to DEFLATE, PNG runs a pre-processing step that makes the patterns easier for DEFLATE to see. It's the only step in the process that knows that we're looking at image data, not text or code or audio or anything else. For each row of pixels, the encoder picks one of five pre-processing filters to use. In this case, it would probably choose to use the Sub
filter, which encodes the first pixel of each row normally and then each subsequent pixel as the difference between itself and the pixel to its left: [<new row, using filter Sub>
, 0, +1, +1, ..., +1, <new row, using filter Sub>
255, -1, -1, ..., -1]. In bytes, Sub
is filter 1 and -1 is stored as its two's complement representation 255, so DEFLATE sees [1 (filter for new row), 0, 1, 1, ..., 1, 1 (filter for new row), 255, 255, 255, ..., 255]. The gradient that DEFLATE didn't understand has become long runs of repeated symbols, a kind of structure DEFLATE does understand and compresses well.
Taken together, these three steps find and exploit structure in intelligible images to give them shorter encodings. In exchange, the format adds a tiny amount of overhead in images where it can't find any structure, if only to say that no fancy encoding tricks were used.
By using PNG, we're saying that we expect to encounter the kind of images that PNG compresses well. (That's usually a safe assumption, but it wouldn't be true if we were working with pictures of unstructured static.) If you tilt your head and squint, PNG defines a probability distribution over the space of possible images. Images that are given shorter encodings are considered more likely to come up than images that are given longer encodings. We set out to save storage space and bandwidth, but along the way we accidentally characterized some of what it is about an image that makes it likely to come up in practice, which is to say, what makes it something other than a block of meaningless static.
Sampling from the distribution that PNG defines ought to produce more intelligible images than sampling from a uniform distribution. How do we sample from the PNG distribution? Well, optimally compressed data looks like random static, so we ought to be able to generate some random bits, assume that they're an optimally compressed image, and feed them through the PNG decoder.
Unfortunately most bit strings aren't well-formed PNG files, so we need to make a few small tweaks to the format to get a decoder that can process random bits without choking. You probably don't care about the details.
First problem: PNG-the-file-format defines more than just a way of encoding pixel data. There are headers and checksums and places to store metadata and unless everything is just right the decoder will reject your file.
Solution: Pull out the part of the decoder responsible for decoding image data and feed it random bits directly, bypassing the rest of the format. (We were going to have to do something like that anyway if we want to generate random 256x256 8-bit RGB images instead of random-size random-bit-depth random-color-type images.)
Second problem: Not every bit string is a valid DEFLATE stream. A DEFLATE stream is a series of blocks, each of which carries a two-bit type code. A type 00
block contains unencoded data, essentially a fast way to let data that doesn't compress well through unprocessed. A type 01
block is encoded with LZ77 and then a prespecified Huffman code given in the DEFLATE spec. A type 10
block is the same as a type 01
block except it starts with a description of custom Huffman code tailored to the data being encoded, which is great, except that the format for specifying custom Huffman codes is such that it's almost certain to choke on random data. A type 11
block is invalid.
Solution: Pretend block types 10
and 11
don't exist. Not ideal, but close enough.
Third problem: Even after ignoring block types 10
and 11
, not every bit string is a valid DEFLATE stream. The reasons are many and varied and boring. There's symbols in the predefined Huffman code that you're not supposed to use and it's possible to give LZ77 references that look back past the beginning of the file and type 00
blocks have error checking data for their length fields and...
Solution: Change the decoder to ignore every type of error that could happen. If you're curious about the details, take a look at the diff between the original and patched decoders.
Fourth problem: PNG throws more than just pixel data into DEFLATE. At the beginning of each row's pixel data it adds one byte saying which filter it used to preprocess the row. There are only 5 filters, so if any of those filter specifier bytes are outside the range 0-4, the image is invalid. I'm pretty sure the filter specifications and the pixel data were mixed more out of convenience than out of a belief that they'd compress especially well sitting alongside each other, so...
Solution: Patch the PNG decoder so that it pulls only pixel data from the DEFLATE stream and takes filter specifiers from another out-of-band source, like a random number generator that generates numbers in the range 0-4.
With those hopefully acceptable departures from the real PNG format, we finally have something that can run on random data without choking.
In conclusion, I am proud to present a freshly generated
x
pixel image, using
and
For UI source code, use your browser's "view source" button. Source code for the PNG decoder WASM module lives on GitHub.
Published 2023-12-30
Next: A quick way to find big elements buried in the DOM
Previous: Smile trainer