Improving the performance of the FFT algorithm on shared-memory systems

Tuesday May 10, 2016
Location: CIC Panther Hollow Room
Time: 4:30PM

Doru Thom Popovici (Carnegie Mellon University)


Current microprocessor trends show a steady increase of the number of cores and/or threads present on the same CPU die. While this increase improves the performance of compute-bound applications, the benefits to memory-bound applications is limited. The discrete Fourier transform (DFT) is an example of such a memory-bounded application, where increasing the number of cores does not yield a corresponding increase in performance. In this paper, we present a solution for improving the data movement from main memory to the compute units that perform the DFT. We sacrifice some cores/threads so that data can be loaded while computation is being performed. Overlapping memory accesses from computation permits us to preload and to reshape the data before computation requires it. We show that this implementation can improve the performance of multi-dimensional DFTs by at least 1.5x over implementations offered by libraries such as MKL and FFTW.


I am a fourth year PhD student in the Electrical and Computer Engineering Department at Carnegie Mellon University. I am advised by Dr. Franz Franchetti. Graduated Computer Science from the Polytechnic University in Timisoara, Romania. My interests are automatic code generation and optimization, high performance computating.