[ANN] AcceleratedDCTs.jl: Device-agnostic 1D/2D/3D DCTs

Hi everyone,

I am excited to announce a new package, AcceleratedDCTs.jl.

Motivation
To improve the simulation tool of my research group, we require high-performance DCTs up to 3D, especially on GPU. While Oceananigans.jl includes a 1D DCT targeting GPU, there was no implementation available that fully supported N-dimensional transforms while remaining device-agnostic.

Features
AcceleratedDCTs.jl implements DCT-II and IDCT-II.

  • Dimensions: Supports 1D, 2D, and 3D arrays.
  • Device Agnostic: Built with KernelAbstractions.jl, so it runs on both CPUs and GPUs.
  • Performance: Uses Real-to-Complex (R2C) FFTs instead of Complex-to-Complex (C2C) to improve memory usage and speed.
  • Plans: supports plans and mul! interface as in AbstractFFTs.jl.

Algorithm
The default implementation follows Makhoul’s algorithm, which utilizes a global pre-processing and post-processing step combined with a simultaneous multi-dimensional FFT. See documentation for more details.

For comparison, I also implemented:

  • dct_batch.jl: performs batched 1D FFTs along each dimension (labeled “Batched DCT” in benchmarks).
  • dct_slow.jl: performs sequential 1D FFTs along each dimension.

Benchmarks
Here are some performance results for 3D DCTs on varying grid sizes (N^3). Results collected using in-place mul! to exclude allocation overhead. Lower is better.

GPU Performance (Nvidia RTX 2080 Ti)

Grid Size (N^3) cuFFT (Baseline) Opt 3D DCT Batched DCT (Old)
16^3 0.068 ms 0.104 ms 0.883 ms
32^3 0.064 ms 0.117 ms 0.908 ms
64^3 0.112 ms 0.237 ms 1.138 ms
128^3 0.818 ms 1.414 ms 3.228 ms
256^3 5.980 ms 10.455 ms 23.120 ms

Note: Opt 3D DCT maintains excellent performance across all sizes. For N=256, it is >2.2x faster than the batched implementation.

CPU Performance (8 Threads, Intel Xeon Gold 6132)

Grid Size (N^3) FFTW rfft Opt 3D DCT FFTW dct Batched DCT
16^3 0.015 ms 0.150 ms 0.058 ms 0.424 ms
32^3 0.100 ms 0.508 ms 0.426 ms 0.693 ms
64^3 1.241 ms 3.856 ms 4.336 ms 8.703 ms
128^3 14.905 ms 38.795 ms 47.904 ms 96.860 ms
256^3 243.146 ms 332.595 ms 420.537 ms 1066.093 ms

Note: On multi-threaded CPU, Opt 3D DCT (332ms) outperforms FFTW.dct (420ms) at large sizes (N=256) by being ~1.26x faster!

Feedback
I am looking for feedback from the community. If anyone can help verify the correctness of the results (they are verified preliminary in the test) or confirm the performance benefits, that would be very helpful. Pull requests are also welcome.

Repository: GitHub - liuyxpp/AcceleratedDCTs.jl: Device agnostic implementation of 1D/2D/3D DCTs.

4 Likes