Let's Write a Simple JPEG Library, Part-I: The Basics

A few weeks ago I started a small side project - implementing the JPEG specification. Though naive, my attempt was surprisingly fruitful and I managed to get a simple decoder working! But it wasn’t easy; I had to read through the serious & technical specification of JPEG and spent many hours debugging stuff. Which is why I decided to document what I did so that if someone like me ever decides to implement JPEG, they have nothing to worry about! :sunglasses: (That is, imagining that someone even visits this blog). Also, this blog was kinda dead and bare, so I guess I needed more content anyway.

Before we delve into the details, let me set one thing clear: this is a tutorial aimed at beginners like myself. As a result, I tend to limit the discussion to the point of interest, while trying to explain things as nicely as I can. So, it may or may not be perfect. You’re welcome to critic/suggest/discuss in the comments thread below :wink:

What Exactly is Covered Here?

We are going to understand what JPEG is, how it works and all the gory details involved along the way. Then, in subsequent posts, we will imlpement a simple, proof of concept JPEG encoder & decoder. And for brevity’s sake, we aren’t going to implement each and every feature of JPEG. Instead, our focus will be only on sequential, baseline JPEG compression, with a channel depth of 8-bits. Since I mentioned these tuts are directed towards novice programmers, I will intentionally tend to use clearer but inefficient code from time to time.

I hope that you have a good learning experience! Now, let’s roll!

What is an Image?

A moment, frozen for eternity that gives us a glimpse of the scene, transcending time… Oh wait, this is a lesson in programming & computer science, not poetry. :stuck_out_tongue:

In layman terms, a digital image is a 2D array whose elements are pixels. The pixels have one or more components depending upon the color model the image uses. Grayscale images have just one pixel component for denoting the intensity of white light. RGB color model has three components per pixel, red, blue and green, which when combined yield the actual color of the pixel.

Illustration-1 Figure-1: A close up of the RGB pixels (Image “Lenna” borrowed from Wikipedia)

What is Image Compression?

For starters, the term image compression refers to a set of steps, that an algorithm takes in order to churn out a modified version of the input image in such a way that:

These are the two main factors that determine how good the image compression technique is. It’s important to keep in mind that like everything else, there’s no such thing as the ultimate image compression algorithm. Which algorithm to use is totally dependent upon the need of the user. For example, JPEG, which is the focus of this post, is good for compressing raw images of natural scenes, which have smooth, gradual variations of color, and it is unsuitable for images with sharp contrasts. Also, JPEG compression is lossy, meaning that depending upon the compression quality, some amount of the original imagedata is thrown away, forever, to reduce the file size. On the other hand, there’s PNG, which is a lossless mode of image compression, meaning, it is guaranteed that all image data from the input can be restored back when decoding an image compressed using PNG. PNG results in much lower file sizes for compressing sharp contrast, e.g., images of text, which JPEG is bad at handling. Both JPEG and PNG work on two dimensional images. There are some image formats like DICOM, which can store accurate 3D images. Hence, DICOM is extremely popular in the medical community for storing medical data of CT scans, MRIs, etc. Since accuracy and correctness are crucial in this case, the corresponding file sizes are much larger than JPEG or PNG.

So, it’s all a matter of trade offs and necessities that determines which compression technique we will use.

Getting Started With JPEG

JPEG stands for Joint Photographic Experts Group, which develops this image compression technique and specifies the official standard, ITU-T.81, which is implemented by all JPEG libraries. At this point I would like to add that JPEG is the compression technique itself: it just specifies the steps for encoding and decoding. The actual file format for storing the images compressed using JPEG is specified in the JFIF and EXIF standards. They define how the compressed image data is to be stored in a file for exchanging and storing the compressed images.

The Heart of JPEG Compression

Now we’re perhaps at the most important part of this post. Now, I’m going to describe why JPEG achieves such great compression. Instead of dealing with the pixels in the the RGB color space, JPEG converts them to the \(YC_bC_r\) color space. The \(YC_bC_r\) color space has three components per pixel: the luminance, or the brightness of the pixel, \(Y\), the amount of blue in the pixel, \(C_b\), and, the amount of red in the pixel, \(C_r\). The \(Y\) component is called the luma component and \(C_b\) and \(Cr\) are known as the chroma components.

The benefit in doing this color space conversion is because of how human eyes work. The retina of a eye has two major types of photoreceptors, rods and cones. The rods are sensitive to the brightness of light and the cones to the color. And the number of rods in the retina massively outnumber the number of cones. As a result, a human eye is able to tell apart light intensity better than they can discriminate between colors. So, our eyes are more sensitive of small variations of low frequency over a large area than to high frequency variations in the same area. As we will later see, that JPEG drops the high frequncy data of the image and stores only the small frequency variations with greater accuracy, resulting in reduced file size. Apart from these two things, JPEG also employs things like Huffman coding, to achieve even more reduction!

So, in a nutshell, the JPEG exploits the fact that the human eye isn’t perfect!

‘Nuf, Talking! Show Me the Steps Already!

Step #1: Color space transformation from RGB to Y-Cb-Cr

Like we discussed previously in the post, an image is nothing but a 2D array whose elements are pixels. Each pixel has different components depending upon the color model of the image. Mostly, this 2D array of pixels that is fed to the JPEG algorithm uses the RGB color space, meaning, the there are three components: Red (\(R\)), Green (\(G\)) and Blue (\(B\)). To convert the pixel to its corresponding representation in the \(YC_bC_r\) color model, we use the following relations:

Step #2: Chroma Subsampling

We already know that the human eye is better at dealing with the brightness or luminescence than the colors. So, the separation of the luma (\(Y\)) and chroma (\(C_b\) & \(C_r\)) components enables us to throw away some of the information relating to the color, without significant loss of visual quality of the image that can be distinguished by the naked eye. So, what JPEG says is that, “take only x \(C_b\) & \(C_r\) values for every y of them”. Basically, we keep the luma component of each and every pixel, but only keep the chroma components once for every few pixels.

Step #3: Level Shifting

Since JPEG allows only 8 bits per channel, the values of the \(Y\), \(C_b\) and \(C_r\) components have the range 0-255 for their values. Before the next step in the encoding process can be applied, these values have to be level shifted to center them at 0. Hence, 128 is subtracted from the all the component values of each pixel.

Step #4: Discrete Cosine Transform (DCT) of Minimum Coded Units (MCU)

Now begins the fun part of the compression procedure. Instead of just processing the \(YC_bC_r\) pixels directly, they are converted to their spatial frequencies.

Given a signal that is a function of time, it can be decomposed into sinusoidal components of varying frequencies that make it up. Some of you may have heard about Fourier transform, which is one of the frequency transforms that can do this. DCT is a Fourier related frequency transform, which is what JPEG uses. DCT allows us to express a sequence of discrete data elements in terms of sum of different frequencies.

Illustration-2 Figure-2: Frequencies that are used to compose an MCU (we will see what this means in a moment). The ones to the top-left have lower frequencies than the ones closer to bottom-right. (Image borrowed from Wikipedia)

First, The image is subdivided into 8x8 pixel blocks, called minimum coded units or MCUs. In case the image dimensions aren’t multiples of 8, the image edges are modified to make them the nearest multiple of 8. For example, if the image is 317x457, its size is converted to 320x464. This stretching is usually done by just copying the rightmost column of pixels for extending horizontally, and the last row of pixels for extending vertically. Each of these 8x8 blocks can be represented by a combination of horizontal and vertical frequencies as described in Figure-2.

Hence, in a nutshell, what DCT does is takes a look at an MCU and then determines what percentage of each frequency can be combined to represent it (note here that DCT sees this 8x8 block as a single entity. Think about the fact that a signal be expressed in terms of other simpler ones of varying frequencies).

If you are still unsure about what a frequency transform is, take a look at this excellent video from 3Blue1Brown (and also subcribe to his YouTube channel if you havn’t already!).

What is the benefit of doing this frequency transform, you ask? The benefit will become more obvious in the next steps. So, please bear with me and brace yourself for the maths that’s about to come!

Now, let’s take a look at the DCT formula:

\(u\) is the horizontal spatial frequency
\(v\) is the vertical spatial frequency
\(C_u,C_v = \cases{ \frac{1}{\sqrt{2}}, & \texttt{if} u,v = 0\cr 0, & \text{otherwise} }\), are the normalizing factors
\(s_{yx}\), is the pixel value at coordinates \((x,y)\), and,
\(S_{vu}\), is the DCT coefficient at coordinates \((u,v)\)

The first equation, Forward DCT, is used when encoding an image from an uncompressed format to the JPEG format and the second equation, Inverse DCT, is used when decoding the byte stream in a JFIF/EXIF file, that was compressed previously using JPEG encoding.

Let’s consider an example MCU for a channel (the example channel can be any one of \(Y\), \(C_b\) or \(C_r\)):

Before applying FDCT, we have to level shift the values (by subtracting 128 from each element), resulting in the following MCU:

Now, using the above formula for calculating FDCT, we get:

(Note that if we take the IDCT of the above MCU and then restore the level shift, we get back the original MCU.)

Step #5: Quantization

After performing the spatial frequency transformation, it can be seen that the magnitude of the top left coefficients are rather large, and that of the first coefficient is the largest. The top-left entries coresspond to low frequency changes, that we are good at observing, and as we move towards the bottom right, the changes are less obvious to our eyes.

And like we already discussed, this flaw is expoloited by JPEG by throwing away a majority of the high frequency variations. Also, if you look more closely, you will observe that if we somehow rounded off these high frequency coefficients to zero, it wouldnot have much effect on the overall visual quality. Hence, what we do for each transformed MCU is that we quantize them. A quantization matrx is used for the quantization, which consists of 8x8 coefficients, denoting “how much” each coefficient has an importance in the overall MCU. The higher the effect, the lower the value of constant used for quantizing it.

Now, to illustrate this principle on our example, we use the quantization table specified in the JPEG standard (ITU-T.81, Annex-K, Table-K.1, Page-143) for luminance with quality 50%:

(Notice that the constants towards the upper-left are significantly smaller than the ones to the bottom-right.)

Quantization is done using the formula below:

\(A^{\prime}\) is the quantized MCU
\(Q\) is the quantization matrix

Hence, the quantized DCT coefficients become:

As you can see, after quantization, most of the bottom-right entries have become zero. And this is the key to the next step of entropy coding.

Step #6: Entropy Coding

The entropy coding step consists of grouping coefficients of similar fequencies together by ordering the 64 coefficients in the 8x8 MCU in a zig-zag manner, and then performing run-length encoding on the 64 element vector so obtained. Then this run-length encoded data is encoded using Huffman coding to furthur reduce the amount of bytes they take. This step is better explained with a practical example rather than just words.

Zig-Zag Traversal of MCU

First we do the zig-zag traversal of the MCU and convert it to a vector of 64 elements. The order of traversal of elements in the zig-zag manner is as follows:

Here, each element corresponds to it’s position in the traversal.

Let’s get back to the example MCU we were encoding. It’s zig-zag traversal is:

Run-Length Encoding

It is important to state here that the first coefficient, -30, also called the DC coefficient will not be encoded using run-length coding like the remaining 63 coefficients, also called the AC coefficients. It’s hard to explain why immediately, but it will become clear in the subsequent steps. For now, just bear in mind that the run-length encoding step applies only to the 63 AC coefficients.

As it’s obvious, there are a lot of zeros. Instead of using this 64 element stream and to save space, we just store the number of zeros preceding a non-zero value. For example, there are no zeros before the occurence of the first 2. So, it gets encoded as \((0,2)\). Similarly, there are two zeros before the occurrence of the second 1, so it gets encoded as \((2,1)\).

The remaining coefficients after the last non-zero value, i.e., 1, can be simply stored by having a convention that says “the rest are all zeros”. And the convention that is used for this is the pair \((0,0)\), or the EOB.

It’s also worth noting here that if the number of zeros before a non-zero value is more than 16, for instance, say, 23, we have can’t just store 23 in the pair. \(( 23 \text{ zeros}, 58)\) gets stored as \((15, 0), (8, 58)\) and not \((23, 58)\). The pair \((15, 0)\), just like \((0, 0)\), is a special value, used to denote sixteen consecutive zeros. But why do this? The reason behind this restriction becomes more clear in the next step of Huffman coding these AC coefficients.

Hence, the run-length encoding of the AC coefficients that is obtained after this step is:

Huffman Coding of the Run-Length Encoded Data

Huffman coding is a lossless data compression technique that can encode a series of input symbols (bytes, characters, etc.) into variable length codes where the frequently used symbols in the input are assigned shorter codes and the less used ones are assigned longer codes. The variable length codes are obtained by analyzing the frequency or probability of the occurence of each symbol in the input stream. Also, Huffman coding guarantees that no code is a prefix of a longer code (that is, you can’t have two codes in a Huffman coding scheme that are 111 and 1110). The exact steps for performing the analysis of frequencies and constructing the Huffman table for encoding can be found in the JPEG specification. In practice, the entries in the Huffman table are stored in the JFIF/EXIF file, from which the Huffman tree of symbols are constructed. But we are going to use the sample tables provided in the JPEG specification (ITU-T.81, Annex-K, Sections K.3.1 & K.3.2, Page- 149) for the remainder of these series of posts.

Encoding the DC Coefficient

Like I said in the last section, the DC coefficient is encoded a bit differently from the AC coefficients. The DC coefficient of an MCU has the largest magnitude and defines the overall value for that MCU. Also, the DC coefficients of consecutive MCUs are related; they only change gradually with respect to the each other. So, instead of encoding the entire magnitude, what JPEG does it stores the difference between the DC coefficients of corresponding MCUs (Note that the difference between the DC coefficients is calculated for each channel).

\(DC_{i+1} = DC_i + \text{difference}\)

The DC coefficient for the \({i+1}^{th}\) block is obtained by adding the \(\text{difference}\) with the DC coefficient of the \(i^{th}\) block. The \(\text{difference}\) for the \(0^{th}\) MCU is always 0.

Now, every \(\text{difference}\) value has a value category, which denotes the minimum number of bits needed to store the difference value’s bit representation. Don’t confuse bit representation of a value with it’s binary representation; instead take a look at the following difference category table and consider the DC coefficient (-30) from our example:

Range DC Difference Category Bits for the value
0 0 -
-1,1 1 0,1
-3,-2,2,3 2 00,01,10,11
-7,-6,-5,-4,4,5,6,7 3 000,001,010,011,100,101,110,111
-15,..,-8,8,..,15 4 0000,..,0111,1000,..,1111
-31,..,-16,16,..,31 5 00000,..,01111,10000,..,11111
-63,..,-32,32,..,63 6 000000,..,011111,100000,..,111111
-127,..,-64,64,..,127 7 .
-255,..,-128,128,..,255 8 .
-511,..,-256,256,..,511 9 .
-1023,..,-512,512,..,1023 A .
-2047,.., -1024,1024,..,2047 B .
-4095,.., -2048,2048,..,4095 C .
-8191,.., -4096,4096,..,8191 D .
-16383,.., -8192,8192,..,16383 E .
-32767,.., -16384,16384,..,32767 F .

From the table, we can see that -30 lies in the range \(-31 \text{ to } 31\), so it belongs to category 5, and, it’s bit representation is 00001, which makes the \(\text{difference}\) to be coded as \((5, 00001)\).

Now, assuming our example MCU is from the luminance(\(Y\)) channel, let’s take a look at the sample Huffman codes from the JPEG specification for encoding DC luminance coefficients (ITU-T.81, Annex-K, Section-K.3.1, Table-K.3, Page- 149):

Category Code length Code word
0 2 00
1 3 010
2 3 011
3 3 100
4 3 101
5 3 110
6 4 1110
7 5 11110
8 6 111110
9 7 1111110
10 8 11111110
11 9 111111110

From the table, we can see that the Huffman code for category 5 is \(110\). Therefore, the final bit stream that will be written to the JFIF/EXIF file is 11000001.

Encoding the AC Coefficients

Now that we know how the DC coefficients are encoded, it’s time to take a look at the AC coefficients. The run-length encoding of the AC coefficients that was obtained previously:

Each of these pairs of AC coefficients is inspected for the number of zeros preceding it, it’s value category, and finally the bit representation for the coefficient. The information for the number of preceding zeros, called the zero-run is denoted using 4 bits (RRRR) and so is the value category of the AC coefficient (SSSS). This is precisely the reason why JPEG limits that the number of zeros in the run-length coding should not exceed 16 as 4 bits can represent a maximum value of 16.

So, each AC coefficient now looks like this: \((RRRRSSSS, \text{Huffman code for coefficient})\). Assuming that the AC coefficents are for the luminance channel, let’s encode them one by one. The Huffman coding table that we will use can be found in the JPEG specification (ITU-T.81, Annex-K, Section-K.3.1, Table-K.5, Page- 150) and you are encouraged to consult the same as the table is too long to be displayed here.

Let’s start with the first coefficient, \((0,2)\). The zero-run of before 2 is 0 and the value category is 2. So, \(RRRR=0\) and \(SSSS=2\). From Table-K.5 mentioned above, it can be seen that the Huffman code for \((RRRR=0,SSSS=2)\) is \(01\). And from the value category table (the first table in last subsection), we find that the bit representaion for \(2\) is \(10\). Hence, bit stream that will be written to the JFIF/EXIF file is 0110.

Taking another example, for the second coefficient \((0,-5)\), \(RRRR=0\) and \(SSSS=3\), which makes the Huffman code to be \(100\). The bit representation of -5 is \(010\), which makes the bit stream for this coefficient 100010.

Similarly, the bit streams for the other coefficients are calculated:
\((1, -2)\) : 11100101
\((0, 1)\) : 001
\((0, -2)\) : 0101
\((2, 1)\) : 110111
\((3, 1)\) : 1110101
\((3, 1)\) : 1110101
\((0, 0)\) : 1010 (EOB)

Combining all the AC coefficients for the this MCU, the final bit stream turns out to be 0110 100010 11100101 001 0101 110111 1110101 1110101 1010.


To sum up what we covered in this post:

In the next part, we will get our hands dirty and start writing a JPEG decoder in C++ (yay! :smile:). You are free to follow along in your language of choice as understanding the steps is all that’s required.