ISSN: 1682-3915 © Medwell Journals, 2016 # Design of Reconfigurable Discrete Cosine Transform in Multicore Architecture <sup>1</sup>R. Radhika and <sup>2</sup>R. Manimegalai <sup>1</sup>S.A. Engineering College, Chennai, India <sup>2</sup>Park College of Engineering and Technology, Coimbatore, India **Abstract:** The image processing has an important play in multimedia systems. The image is processed and compressed effectively to reduce the storage area. For the process of compressing it is important to convert spatial domain to frequency domain. This study proposes the application of DCT algorithm in multi core FPGA (Field Programmable Gate Array) to enhance the performance by simultaneously running all the cores. The matrix format image is transformed into serial format (Hexa decimal) using MATLAB. To perform the DCT simultaneously and in a parallel manner, the serial format values are put into the XILINX. There is severe delay experienced by each core and the processing speed is also reduced. This happens due to more time taken for read operation of memory. The transformation performance in multi core is compared with single core. The parameters like delay and power consumption are reduced in multi core than in single core. Key words: DCT, FPGA, multi core, MATLAB, single core # INTRODUCTION Now a day's image processing are used for many applications like photography, X-ray imaging, electronic imaging, remote sensing, color processing, pattern recognition and video processing. Compression technique is needed for the purpose of storage and processing. The transformation from spatial domain to frequency domain is needed to compress the image. By applying DCT, we can the low frequency component and high frequency component of an image can be differentiated. Since our human eye is less sensitive to changes in high frequency, the high frequency components can be kept away to decrease the memory storage. There are eight standard DCT variants. The DCT-II is mostly used discrete cosine transform in image processing and DCT-III is widely used to calculate inverse DCT. The DCT produces high energy compaction. And hence it converts the correlated components into uncorrelated components. Amulti-coreprocessorhas two or more cores. Each cores of the processor executes instructions like an individual computer. The real processors are still on one chip. On this chip every core looks mostly similar like others. There are varieties of cores which work in parallel manner. Adual-coreprocessor uses two independent. Among other internal components, a processor involves a central processing unit which is referred to as a core that functions arithmetic and logic operations at greater speeds. A desktop with a single-core processor performs one progression at a particular time but like a short-order cook, it switches faster between different tasks, giving a misconception that it's doing many things at a particular time. A quad-core processor, by contrast, has four CPUs on a single chip and executes four separate operations correspondingly by drastically minimizing the delay times and upgrading the computer's productivity. Although, a quad-core processor has several CPUs, they chunk other elements, like the random access memory. Memory bandwidth, the velocity at which the chip approaches data in RAM can turn into a bottleneck when all the processors need to read information and store results. Literature review: This Radhika and colleagues reviewed various issues in multi core processor due to power and performance tradeoffs. Aruli et al. (2015) Reviewed various issues in multi core processor through the parameters like number of cores, cache coherence and power dissipation (Kika and Greca, 2013). A fast 2D-DCT parallel JPEG encoding is implemented to enhance the performance of image compression. The above mentioned approach is run on 1, 2, 4, 8, 16 and 32 processors in SESC while, in GPU, number of cores is fixed to 96 (Shatnawi and Shatnawi, 2014). The 512×512×8 bit image is transformed using 2D DCT in about 26.3 ms with 10 MHz input clock. This study talks about how to effectively manage the power by reducing the number of arithmetic operations and also by minimizing the bit-width for the arithmetic logic and then 2D DCT is implemented on FPGA. Aruli et al. (2015). The tradeoff between the robustness and the power efficiency is discussed by both hardware and software approaches. For which Coarse-Grained Reconfigurable Architecture (CGRA) processor and Explicit Redundancy Linear Array (ERLA) is experimented by Yao et al. (2014). To handle complex and time consuming mathematical calculations, DCT/Inverse DCT (IDCT) quantization/de-quantization algorithms are demonstrated by utilizing multiple processing cores (Kakoulli et al., 2012). An 8-point DCT and IDCT with faster implantations is designed using Verilog HDL code for image and video compression. And also, coefficients of floating point values are transformed to integer values for Loffler factorization for image transformation (Yao et al., 2014). An efficient cache controller is designed for use in FPGA-based processors. This research work targets to achieve less circuit complexity. Challenges faced by the multicore architecture design and its performance is described and analyzed in detailed manner to come up with a solution for parallel programming (Mahajan and Chitode, 2014). Both single and multi-core CPU are taken into account to test several image processing algorithms. From the experiments, the authors showed that the multithreading approach improves the performance. When critical image processing algorithms is implemented, the multithreading performance can efficiently increase the process on the multi-core CPU. Khan et al. (2013). Utilized mixed cell architecture to improve the multi-core performance by making use of both robust and non-robust cells. The performance is improved by 17% by this mechanism and the dynamic power in the L1 data cache is reduced by 50% (Shukla et al., 2013). Image compression concepts: The process of image compression is using limited capacity of data to provide the authentic data without distortion of them. Image compression is an application of data compression that encodes the original image with few bits. The objective of data compression is to minimize the repetition of the image and to store or pass the data in an economical form. The main aim of such system is to minimize the storage capacity as much as possible and the decoded image shown in the screen can be same as to the infanted image. There are two steps in image compression algorithm which are described in Fig. 1 and 2. Images are compressed by Joint Photographic Experts Group (JPEG) algorithm. This is a scheme not to produce the compressed data in the image such that reduced amount of memory is used so far the data appears to be very same. Most of the time, the derived JPEG images will seem almost identical as the authentic images, unless the quality is minimized efficiently. The JPEG algorithm takes dominance of the case that at high frequencies, colors are not visible by the humans. These high frequencies are the data marks in the image that are deleted during the compression. Image processing on smooth color is worked efficiently by JPEG compression. ## MATERIALS AND METHODS Discrete cosine transform: A NxN size of image is converted from the spatial domain to the frequency domain in JPEG compression scheme which uses the Discrete Cosine Transform (DCT). The signal is demolished into the spatial frequency components by the DCT which is called as the DCT. The minimum frequency DCT coefficients are developed in the top left side corner of the DCT matrix. In the lower right-hand corner of the DCT matrix the higher frequency coefficients are appeared. The errors in high frequency coefficients are less sensitive to Human Visual System (HVS). The errors in low frequency coefficients are more sensitive to Human Visual System (HVS). Because of this, the higher frequency components can be discarded or quantized more finely. The values of the matrix indicate the pixel intensity. The pixel of the gray scale image contains 8 bit values. The value will vary from 0-255 in decimal. Adiscrete Cosine Transform(DCT) defines a longer array ofdata marks in terms of an addition ofcosinefunctions that oscillate at various frequencies. DCTs are highly useful to several number of applications in the engineering and science, from compressionofaudio that is MP3 (Vikram and Harika, 2014) andimages that is JPEG to spectral process for the solution of partial differential equations. It provides a complex situation in the use of The use ofcosinerather thansinefunctions, since it provides those fewer cosine functions required to precise asignal whereas for differential equations the discrete cosines provides a separate choice of conditions: $$D(i,j) = \frac{1}{\sqrt{2N}} C(i)C(j) \sum_{x=0}^{n-1} \sum_{y=0}^{n-1} P(x,y)$$ $$\left( \cos \frac{(2x+1)i\pi}{2N} \cos \frac{(2y+1)j\pi}{2N} \right)$$ (1) $$\begin{cases} C(u) = (1 - \sqrt{2}) & \text{if } u = 0 \\ & \text{if } u > 0 \end{cases}$$ (2) **Multicore architecture:** A multi-core processor is designed by combining two or more processors to achieve better performance, to scale down the power utilization and more economical concurrent processing of multiple tasks. The instructions are usual CPU #### DCT BASED ENCODER Fig. 1: DCT based encoder processing steps #### DCT BASED DECODER Fig. 2: DCT based decoder processing steps instructions such as read, write data, and fetch, but more cores can execute several instructions at the simultaneous time, increasing all speed for modules that is to parallel processing. They usually combine the cores into a single integrated circuit (known as a chip multiprocessor), or into multiple core in a chip package. Processors were usually produced with only single core. In the 1980s Rockwell International manufactured with two cores on single chip as sharing the chip's pin on various clock cycles. Multi-core processors were produced in the early 2000s by Intel, AMD. It implements multi processing in one single package. Designers may dual cores in a multi-core device tightly or elastically. **Implementation of DCT in multicore:** While performing the DCT in multiple core architecture the image is placed in a matrix format which is a complex task. Hence the image is transformed from the matrix values to the hexadecimal values which are called as serial values. Then these hexa decimal images are dumped into the Xilinx to perform DCT parallely and simultaneously. In this we are using four cores. For example the original matrix of size $(64\times64)$ is split into 4 parts of $32\times32$ matrix. And each part is assigned to each core to perform parallel operation. It is clearly depicted in Fig. 3. In a rapid integer DCT method on multi-core processor, the instructions executed by a digital image processing are allocated with proper and well formed data flows for enhancing the hardware usage of every task engine of a digital image processor. Hence, usual terms produce well formed arithmetical coding. The well formed arithmetical instructions are arranged significantly for task engines in a parallel manner. The loading of the digital image processor can be efficiently minimized in producing the integer discrete cosine transformation to accordingly generate the result in a very faster way (Yogesh and Khobragade, 2013). Fig. 3: Representation of DCT implementation in multi core Fig. 4: Representation of original image in matrix format. Table 1: Performance comparison for single core and multi core DCT implementation | Particulars | Single | 4 cores | 8 cores | 16 cores | |-------------------------|--------|---------|---------|----------| | No. of slices | 1013 | 4345 | 9059 | 19296 | | No. of sliced flipflops | 1125 | 4504 | 8805 | 18256 | | No. of 4 input LUTs | 1503 | 6880 | 4805 | 31312 | | No. of bonded of IOBs | 68 | 269 | 529 | 1072 | | Delay in ns | 10.143 | 15.371 | 15.436 | 15.436 | | Frequency in MHz | 98.592 | 65.058 | 64.784 | 64.784 | With tending to greater confinement rate and high resolution needed for multimedia data confinement techniques, it requests for the real-time coding/decoding and a rapid algorithm and decoding module is widely needed. The discrete transformation is the main process in the multimedia systems. #### RESULTS AND DISCUSSION The DCT performance in multi core architecture is developed by using MATLAB and XILINX. The hard ware circuit complexity is minimized by changing the image data in matrix format into serial format. For this input data, it is found that the DCT coefficients are using XILINX. The image can be reconstructed again and again using MATLAB Software. The implementation of the multi core architecture is compared with the implementation of single core architecture for the image to change from spatial domain to frequency domain using DCT algorithm and the execution speed is increased in multi core architecture. Figure 4 depicts the representation of original image in matrix format. All the modules designed here have been simulated to identify if they are being performed as needed. After simulation, the whole design has been run to get its timing summary, device utilization summary, area report and power report. This section presents the tools used, the simulation results and wave forms of all modules designed to show the cache controller working. Design has been simulated and synthesized using ISE Design Suite. By applying DCT the data that is needed to present the new image is reduced by the energy computation property of DCT. The significant lower frequency details to present the image is fixed in the top left corner and the less important higher frequency components are fixed at the bottom right corner. The high frequency components may be eliminated since they are less important for visual purpose. And the performance is compared with the single core and multi core DCT implementation as in Table 1. #### CONCLUSION The efficient and optimized Design of Reconfigurable Discrete Cosine Transform in Multi core Architecture is proposed in this paper with intent to enhance the performance of image compression by implementing DCT in multi core architecture. It is projected for the image processing with reduced transformation time by parallel and simultaneous processing nature of the multi core architecture. And the parameters like time and power are greatly improved by using the DCT implementation in multi core architecture than that of the single core architecture. All the operations of each core in a processor is parallelly executed using cache memory and hence the delay, memory usage and power consumption is reduced. The design has been implemented on SPARTAN 3 evaluation board. From the synthesis report, it could be seen that the maximum output required time, i.e., hold-time after clock is 15.436ns and minimum input time before clock arrival, i.e., setup-time for designed module is 10.384ns. The maximum clock frequency is 64.058 MHz. ### REFERENCES Aruli, K., B. Nivetha, G.N. Jayabhavani and D.M. Saravanan, 2015. A review on trends in multi core processor based on cache and power dissipation. Int. Res. J. Eng. Technol. IRJET., 02: 187-190. Kakoulli, E., V. Soteriou and T. Theocharides, 2012. Intelligent hotspot prediction for network-on-chip-based multicore systems. IEEE. Trans. Comput. Aided Des. Integr. Circuits Syst., 31: 418-431. Khan, S.M., A.R. Alameldeen, C. Wilkerson, J. Kulkarni and D.A. Jimenez, 2013. Improving multi-core performance using mixed-cell cache architecture. Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA2013), February 23-27, 2013, IEEE, College Station, Texas, ISBN:978-1-4673-5585-8, pp: 119-130. Kika, A. and S. Greca, 2013. Multithreading image processing in single-core and multicore CPU using java. Int. J. Adv. Comput. Sci. Appl., 4: 165-169. Mahajan, N.V. and J.S. Chitode, 2014. DCT/IDCT implementation with loeffler algorithm. Int. J. Appl. Innov. Eng. Manage., 3: 353-357. - Shatnawi, M.K.A. and H.A. Shatnawi, 2014. A performance model of fast 2D-DCT parallel JPEG encoding using CUDA GPU and SMP-architecture. Proceedings of the 2014 IEEE Conference on High Performance Extreme Computing (HPEC), September 9-11, 2014, IEEE, Al-Kharj, Saudi Arabia, ISBN: 978-1-4799-6233-4, pp. 1-6. - Shukla, S.K., V. Trivedi and A. Choukse, 2013. Challenges on performance analysis and enhancement of multi-core architecture, a solution parallel programming languages. Int. J. Eng. Adv. Technol., 3: 110-114. - Vikram, A. and T. Harika, 2014. Low power DCT architecture for Image/video coders. IPASJ. Int. J. Electron. Commun., 2: 56-65. - Yao, J., M. Saito, S. Okada, K. Kobayashi and Y. Nakashima, 2014. EReLA: A low-power reliable coarse-grained reconfigurable architecture processor and its irradiation tests. IEEE. Trans. Nuclear Sci., 61: 3250-3257. - Yogesh, S.W. and A.S. Khobragade, 2013. Design of cache memory with cache controller using VHDL. Int. J. Innov. Res. Sci. Eng. Technol. IJIRSET., 2: 2914-2919.