Summary

This application note describes a two-dimensional Inverse Discrete Cosine Transform (2D IDCT) function implemented on a Xilinx FPGA. The reference design file provides behavioral code for implementation on any Xilinx device. Some advantages of the module include the ability to parametrize the IDCT function and to guarantee performance. The code can be further optimized by instantiating embedded adders and multipliers when targeting the Virtex™-II series of FPGAs. After an initial latency of 84 clock cycles, one IDCT value is output at every clock. A "Video Compression Using DCT" application note (XAPP610) covers DCT compression.

Decompression

Compression is the process of reducing the size of the data sent, thereby, reducing the bandwidth required for the digital representation of a signal. Compression technology can result in reduced transmission time due to less data being transmitted. It also decreases the storage requirements because there is less data. The data compressed at the transmitter needs to be decompressed at the receiver. IDCT is used to decompress DCT compressed data in the decoder. DCT/IDCT are two of the most computation intensive functions in compression. Therefore, a fast and optimized DCT/IDCT implementation is essential in improving the performance of the video coder and decoder. Figure 1 is a block diagram of a compression/decompression system.

IDCT Compression

Technical Aspects of the Compression Algorithm

The following material provides a brief description of IDCT. Further details can be found in "Image and Video Compression Standards" by Vasudev Bhaskaran and Konstantinos Konstantinides[1]. The 16-bit signed input provides 12-bit signed out pixels for the IDCT. In the 16 bits, disregarding the sign bit, there are 15 bits of data including the fraction bit. Doing an IDCT divides these 15 bits of data by eight (represented by 3 bits) resulting in 15 – 3 or 12 bits. Since the IDCT is a perfect inverse of DCT, by starting with an integer, an integer must be received back. From the 12 bits, the one fraction bit is ignored and the sign bit is added back.

![Figure 1: Block Diagram of a Compression/Decompression System][1]
giving a final 12-bit result. The algorithm used for the calculation of the 2D IDCT coefficients is based on the following equation:

\[
X_{C_{pq}} = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} X_{N_{nm}} \cdot \frac{c(p)c(q)}{4} \cdot \cos\left(\frac{\pi(2m+1)p}{2M}\right) \cdot \cos\left(\frac{\pi(2n+1)q}{2N}\right) \ (EQ \ 1)
\]

First, the 1D IDCT of the rows are calculated and then the 1D DCT/IDCT of the columns are calculated. The 1D DCT coefficients for the rows and columns can be calculated by separating equation 1 into the row part and the column part.

\[
C = K \cdot \cos\left(\frac{2 \cdot \text{col number} + 1 \cdot \text{row number} \cdot \pi}{2 \cdot M}\right) \ (EQ \ 2), \text{ where}
\]

- \( K = \frac{\sqrt{2M}}{N} \) for row = 0
- \( K = \frac{\sqrt{2N}}{M} \) for row ≠ 0

\[
C^t = K \cdot \cos\left(\frac{2 \cdot \text{row number} + 1 \cdot \text{col number} \cdot \pi}{2 \cdot N}\right) \ (EQ \ 3), \text{ where}
\]

- \( K = \frac{\sqrt{2N}}{M} \) for col = 0
- \( K = \frac{\sqrt{2M}}{N} \) for col ≠ 0

and \( M = \text{total number of columns} \), \( N = \text{total number of rows} \).

The constant values for \( C \) and \( C^t \) calculated from equations 2 and 3 are as follows:

\[
C = \begin{bmatrix}
23170 & 23170 & 23170 & 23170 & 23170 & 23170 & 23170 \\
32138 & 27246 & 18205 & 6393 & -6393 & -18205 & -27246 & -32138 \\
30274 & 12540 & -12540 & -30274 & -30274 & -12540 & 12540 & 30274 \\
27246 & -6393 & -32138 & -18205 & 18205 & 32138 & 6393 & -27246 \\
23170 & -23170 & -23170 & -23170 & 23170 & 23170 & 23170 & 23170 \\
18205 & -32138 & 6393 & 27246 & -27246 & -6393 & 32138 & -18205 \\
12540 & -30274 & 30274 & -12540 & -12540 & -30274 & 30274 & 12540 \\
6393 & -18205 & 27246 & -32138 & 32138 & -27246 & 18205 & -6393
\end{bmatrix}
\]

\[
C^t = \begin{bmatrix}
23170 & 23170 & 30274 & 27246 & 23170 & 18205 & 12540 & 6393 \\
32138 & 27246 & 12540 & -6393 & -23170 & -32138 & -30274 & -18205 \\
23170 & 18205 & -12540 & -32138 & -23170 & 6393 & 30274 & 27246 \\
30274 & -12540 & -6393 & -30274 & -18205 & 23170 & 27246 & -12540 & -32138 \\
23170 & -6393 & -30274 & -18205 & 23170 & -27246 & -12540 & 32138 \\
23170 & -12540 & -32138 & -23170 & -6393 & 30274 & -27246 & 18205 & 6393 \\
23170 & 18205 & -6393 & 30274 & -27246 & 23170 & -18205 & 12540 & -6393 \\
23170 & -32138 & 30274 & -27246 & 23170 & -18205 & 12540 & -6393
\end{bmatrix}
\]
A one-dimensional 8-point IDCT followed by an internal double buffer memory, followed by another one-dimensional 8-point IDCT provides the 2D IDCT architecture.

Vector processing using parallel multipliers is a method used for implementation of IDCT. The advantages in the vector processing method are regular structure, simple control and interconnect, and good balance between performance and complexity of implementation.

Behavioral Model

Using vector processing, the output Y of an $8 \times 8$ IDCT for input X is given by $Y = C^tX \cdot C$, where $C$ is the cosine coefficients and $C^t$ are the transpose coefficients. This equation can also be written as $Y = C^tZ$, where $Z = X \cdot C$. Using the cosine $C$ and inverse cosine $C^t$ numbers in equation 2 and 3, the intermediate value $Z = X \cdot C$ is calculated as follows:

$$Z(k,0) = \begin{bmatrix} x00 & x01 & x02 & x03 & x04 & x05 & x06 & x07 \\ x10 & x11 & x12 & x13 & x14 & x15 & x16 & x17 \\ x20 & x21 & x22 & x23 & x24 & x25 & x26 & x27 \\ x30 & x31 & x32 & x33 & x34 & x35 & x36 & x37 \\ x40 & x41 & x42 & x43 & x44 & x45 & x46 & x47 \\ x50 & x51 & x52 & x53 & x54 & x55 & x56 & x57 \\ x60 & x61 & x62 & x63 & x64 & x65 & x66 & x67 \\ x70 & x71 & x72 & x73 & x74 & x75 & x76 & x77 \end{bmatrix}$$

and


Or:

$$Z(k,0) = (23170x_{k0} + 32138x_{k1} + 23170x_{k2} + 23170x_{k3} + 23170x_{k4} + 23170x_{k5} + 23170x_{k6} + 23170x_{k7})$$

$$Z(k,1) = (32138 \cdot 27246 + 18205 \cdot 6393 - 6393 \cdot -18205 - 27246 - 32138)$$

$$Z(k,2) = (30274 \cdot 12540 - 12540 \cdot -30274 - 30274 \cdot -12540 - 30274)$$

$$Z(k,3) = (27246 \cdot -6393 - 32138 \cdot -18205 - 18205 \cdot 32138 - 27246 \cdot 6393)$$

$$Z(k,4) = (23170 \cdot -23170 - 23170 \cdot -23170 - 23170 \cdot -23170 - 23170 \cdot -23170 - 23170 \cdot -23170 - 23170 \cdot -23170)$$

$$Z(k,5) = (18205 \cdot -32138 + 6393 \cdot 27246 - 27246 \cdot -6393 - 32138 \cdot 18205)$$

$$Z(k,6) = (12540 \cdot -30274 + 30274 \cdot 30274 - 12540 \cdot -30274 - 30274 \cdot 30274)$$

$$Z(k,7) = (6393 \cdot -18205 + 27246 \cdot -32138 + 18205 \cdot 6393 - 27246 \cdot 18205)$$

where $k = 0, 2, \ldots, 7$. 

---

**Figure 2:** 2D-IDCT Using Vector Processing
Each element in the first row of the input matrix X is multiplied by each element in the first column of matrix C and added together to get the first value $Z_{00}$ of the intermediate matrix Z. To get $Z_{01}$, each element of row zero in X is multiplied by each element in the first column of C and added and so on. As shown, input $X_{00}$ gets multiplied by all the coefficients in the first row of C, and input $X_{01}$ gets multiplied by all the coefficients in the second row of C, and so on.

\[
\begin{align*}
X_{00} & \rightarrow \begin{bmatrix} 23170 & 23170 & 23170 & 23170 & 23170 & 23170 & 23170 & 23170 \end{bmatrix} \\
X_{01} & \rightarrow \begin{bmatrix} 32138 & 27246 & 18205 & 6393 & -6393 & -18205 & -27246 & -32138 \end{bmatrix} \\
X_{02} & \rightarrow \begin{bmatrix} 30274 & 12540 & -12540 & -30274 & -30274 & -12540 & 12540 & 30274 \end{bmatrix} \\
X_{03} & \rightarrow \begin{bmatrix} 27246 & -6393 & -32138 & -18205 & 18205 & 32138 & 6393 & -27246 \end{bmatrix} \\
X_{04} & \rightarrow \begin{bmatrix} 23170 & -23170 & -23170 & 23170 & 23170 & -23170 & -23170 & 23170 \end{bmatrix} \\
X_{05} & \rightarrow \begin{bmatrix} 18205 & -32138 & 6393 & 27246 & -27246 & -6393 & 32138 & -18205 \end{bmatrix} \\
X_{06} & \rightarrow \begin{bmatrix} 12540 & -30274 & 30274 & -12540 & -12540 & 30274 & -30274 & 12540 \end{bmatrix} \\
X_{07} & \rightarrow \begin{bmatrix} 6393 & -18205 & 27246 & -32138 & 32138 & -27246 & 18205 & -6393 \end{bmatrix}
\end{align*}
\]

The matrix above is implemented by using eight multipliers and storing the coefficients in ROMs. At the first clock, the eight inputs $x_{00}$ to $x_{07}$ are multiplied by the eight values in column one, resulting in eight products ($P_{00,0}$ to $P_{00,7}$). At the eighth clock, the eight inputs are multiplied by the eight values in column eight resulting in eight products ($P_{07,0}$ to $P_{07,7}$). From the equations for Z, the intermediate values for the first row of Z are computed:

\[
\begin{align*}
Z_{00} &= P_{00,0} + P_{00,1} + P_{00,2} + P_{00,3} + P_{00,4} + P_{00,5} + P_{00,6} + P_{00,7} \\
Z_{01} &= P_{01,0} + P_{01,1} + P_{01,2} + P_{01,3} + P_{01,4} + P_{01,5} + P_{01,6} + P_{01,7} \\
Z_{ij} &= P_{ij,0} + P_{ij,1} + P_{ij,2} + P_{ij,3} + P_{ij,4} + P_{ij,5} + P_{ij,6} + P_{ij,7}
\end{align*}
\]

Where the matrix X row number $i = 0$ to $7$, and the matrix C column number $j = 0$ to $7$. The intermediate values for the second row of Z, $Z_{1K}$, are computed by using $P_{1K}$ values. The $8 \times 8$ Z matrix can be calculated using these equations. The block diagram for the implementation of the 1D-IDCT is shown in Figure 3.
The values for $Z_{0(0 - 7)}$ can be calculated in eight clock cycles. If the adder output is also registered, $Z_{07}$ is calculated at the ninth clock cycle. All 64 values of $Z$ are calculated in 64 clock cycles and then the process is repeated. The values of $Z$ correspond to the 1D-IDCT of the input $X$. Once the $Z$ values are calculated, the 2D-IDCT function $Y$ can be calculated from $Y = CZ$.

$$
C' = \begin{bmatrix}
23170 & 32138 & 30274 & 27246 & 23170 & 18205 & 12540 & 6393 \\
23170 & 27246 & 12540 & -6393 & -23170 & -32138 & -30274 & -27246 \\
23170 & 18205 & -12540 & -32138 & -23170 & -6393 & -30274 & -27246 \\
23170 & 6393 & -30274 & -18205 & 23170 & 27246 & -12540 & -32138 \\
23170 & -6393 & -30274 & 18205 & 23170 & -27246 & -12540 & 32138 \\
23170 & -18205 & -12540 & 32138 & -23170 & -6393 & 30274 & -27246 \\
23170 & -27246 & 12540 & 6393 & -23170 & 32138 & -30274 & 18205 \\
23170 & -32138 & 30274 & -27246 & 23170 & -18205 & 12540 & -6393 \\
\end{bmatrix} \quad Z = \begin{bmatrix}
z_{00} & z_{01} & z_{02} & z_{03} & z_{04} & z_{05} & z_{06} & z_{07} \\
z_{10} & z_{11} & z_{12} & z_{13} & z_{14} & z_{15} & z_{16} & z_{17} \\
z_{20} & z_{21} & z_{22} & z_{23} & z_{24} & z_{25} & z_{26} & z_{27} \\
z_{30} & z_{31} & z_{32} & z_{33} & z_{34} & z_{35} & z_{36} & z_{37} \\
z_{40} & z_{41} & z_{42} & z_{43} & z_{44} & z_{45} & z_{46} & z_{47} \\
z_{50} & z_{51} & z_{52} & z_{53} & z_{54} & z_{55} & z_{56} & z_{57} \\
z_{60} & z_{61} & z_{62} & z_{63} & z_{64} & z_{65} & z_{66} & z_{67} \\
z_{70} & z_{71} & z_{72} & z_{73} & z_{74} & z_{75} & z_{76} & z_{77} \\
\end{bmatrix}
$$
Y = CZ

\[ Y_{0X} = 23170(Z_{0X} + Z_{7X}) + 23170(Z_{1X} + Z_{6X}) + 23170(Z_{2X} + Z_{5X}) + 23170(Z_{3X} + Z_{4X}) \]
\[ Y_{1X} = 32138(Z_{0X} - Z_{7X}) + 27246(Z_{1X} - Z_{6X}) + 18205(Z_{2X} - Z_{5X}) + 6393(Z_{3X} - Z_{4X}) \]
\[ Y_{2X} = 30274(Z_{0X} + Z_{7X}) + 12540(Z_{1X} + Z_{6X}) - 12540(Z_{2X} + Z_{5X}) - 30274(Z_{3X} + Z_{4X}) \]
\[ Y_{3X} = 27246(Z_{0X} - Z_{7X}) - 6393(Z_{1X} - Z_{6X}) - 32138(Z_{2X} - Z_{5X}) - 18205(Z_{3X} - Z_{4X}) \]
\[ Y_{4X} = 23170(Z_{0X} + Z_{7X}) - 23170(Z_{1X} + Z_{6X}) - 23170(Z_{2X} + Z_{5X}) + 23170(Z_{3X} + Z_{4X}) \]
\[ Y_{5X} = 18205(Z_{0X} - Z_{7X}) - 32138(Z_{1X} - Z_{6X}) + 6393(Z_{2X} - Z_{5X}) + 27246(Z_{3X} - Z_{4X}) \]
\[ Y_{6X} = 12540(Z_{0X} + Z_{7X}) - 30274(Z_{1X} + Z_{6X}) + 30274(Z_{2X} + Z_{5X}) - 12540(Z_{3X} + Z_{4X}) \]
\[ Y_{7X} = 6393(Z_{0X} - Z_{7X}) - 18205(Z_{1X} - Z_{6X}) + 27246(Z_{2X} - Z_{5X}) - 32138(Z_{3X} - Z_{4X}) \]

Each element in the first row in the coefficient matrix C is multiplied by each element in the first column of matrix Z and added together to get the first value \( Y_{00} \) of the output matrix Y. To get \( Y_{01} \), each element of row zero in C is multiplied by each element in the first column of Z and is added as before. Multiplying row k of C by column zero of Z results in the coefficient \( Y_{k0} \). All the elements in the first row of matrix Z are multiplied by all the elements in the first column of matrix C.

\[
\begin{align*}
23170 & \rightarrow [z_{00} \ z_{01} \ z_{02} \ z_{03} \ z_{04} \ z_{05} \ z_{06} \ z_{07}] \\
23170 & \rightarrow [z_{10} \ z_{11} \ z_{12} \ z_{13} \ z_{14} \ z_{15} \ z_{16} \ z_{17}] \\
23170 & \rightarrow [z_{20} \ z_{21} \ z_{22} \ z_{23} \ 24 \ z_{25} \ z_{26} \ z_{27}] \\
23170 & \rightarrow [z_{30} \ z_{31} \ z_{32} \ z_{33} \ z_{34} \ z_{35} \ z_{36} \ z_{37}] \\
23170 & \rightarrow [z_{40} \ z_{41} \ z_{42} \ z_{43} \ z_{44} \ z_{45} \ z_{46} \ z_{47}] \\
23170 & \rightarrow [z_{50} \ z_{51} \ z_{52} \ z_{53} \ z_{54} \ z_{55} \ z_{56} \ z_{57}] \\
23170 & \rightarrow [z_{60} \ z_{61} \ z_{62} \ z_{63} \ z_{64} \ z_{65} \ z_{66} \ z_{67}] \\
23170 & \rightarrow [z_{70} \ z_{71} \ z_{72} \ z_{73} \ z_{74} \ z_{75} \ z_{76} \ z_{77}] \\
\end{align*}
\]

The calculation is implemented by using eight multipliers and storing the coefficients in ROMs. At the first clock, the eight coefficients in row zero of C are multiplied by the eight values in the first column of Z and are added together to result in \( Y_{00} \). At the eighth clock, the eight values of row zero of C are multiplied by the eight values in column eight of Z. The resultant is added to give \( Y_{07} \).
The block diagram for the implementation of the 2D-IDCT is shown in Figure 4.

**Figure 4:** 2D-IDCT Implementation (K is 0-7)

**Implementation Description**

Figure 3 and Figure 4 show the block diagrams for a reference design implementation. The 1D-DCT values are first calculated and stored in a RAM. The second 1D-DCT is done on the values stored in the RAM. For each 1D implementation, input data are loaded into the first input of a multiplier. The constant coefficient multiplication values are stored in a ROM and fed into the second input of the multiplier. At each clock, eight input values are multiplied with eight constant coefficients. The output of the eight multipliers are added together giving the 1D coefficient values stored in the RAM. The values stored in the intermediate RAM are read out one column at a time (i.e., every eighth value is read out every clock). This is the input for the second DCT. The RDY_IN signal is used as a hand shake signal between the DCT and the IDCT functions. When High, the RDY_IN signal indicates valid DCT outputs from the DCT function to the IDCT function.
Reference Design

This application note and reference design includes a basic explanation of IDCT. The Verilog and VHDL code shows the implementation of a 2D IDCT. The code implements a 2D IDCT with 9-bit input data and 8-bit output data. The number of input/output data bits can be changed as needed. The code is available on the Xilinx web site at:


The code has been tested and simulated against the vectors given in "Image and Video Compression Standards," second edition, by Vasudev Bhaskaran and Konstantinos Konstantinides, ISBN 0-7923-9952-8 [1], on page 60. This code has not been tested for IEEE compliance.

Table 1: Resource Utilization

<table>
<thead>
<tr>
<th>Device</th>
<th>Speed Grade</th>
<th>Package</th>
<th>Pre-Map (Synthesis Constraint)</th>
<th>Post-Route</th>
<th>Slices</th>
</tr>
</thead>
<tbody>
<tr>
<td>XCV300E</td>
<td>-8</td>
<td>BG352</td>
<td>83.96 MHz (120 MHz)</td>
<td>57.89 MHz</td>
<td>1343 (43%)</td>
</tr>
<tr>
<td>XC2V250</td>
<td>-6</td>
<td>FG456</td>
<td>176.99 MHz (180 MHz)</td>
<td>120.95 MHz</td>
<td>944 (61%)</td>
</tr>
<tr>
<td>XCV300</td>
<td>-6</td>
<td>PQ240</td>
<td>104.29 MHz (100 MHz)</td>
<td>46.97 MHz</td>
<td>1341 (43%)</td>
</tr>
<tr>
<td>XC2S200</td>
<td>-6</td>
<td>FG256</td>
<td>93.56 MHz (80 MHz)</td>
<td>53.49 MHz</td>
<td>1372 (58%)</td>
</tr>
</tbody>
</table>

References and Recommended Links

2. A variety of cores developed by Xilinx or by partners are available on the Xilinx web site:

Conclusion

The IDCT design files demonstrate an efficient implementation of the IDCT algorithm using Virtex devices. The IDCT reference design files can be used to target any Xilinx device. Optimize the code by instantiating the adder/subtractor and multiplier units when targeting Virtex devices.

Revision History

The following table shows the revision history for this document.

<table>
<thead>
<tr>
<th>Date</th>
<th>Version</th>
<th>Revision</th>
</tr>
</thead>
<tbody>
<tr>
<td>03/20/02</td>
<td>1.0</td>
<td>Initial Xilinx release.</td>
</tr>
<tr>
<td>06/25/02</td>
<td>1.1</td>
<td>Revised Figure 3 and Figure 4.</td>
</tr>
</tbody>
</table>