Hi everyone,
I am excited to announce a new package, AcceleratedDCTs.jl.
Motivation
To improve the simulation tool of my research group, we require high-performance DCTs up to 3D, especially on GPU. While Oceananigans.jl includes a 1D DCT targeting GPU, there was no implementation available that fully supported N-dimensional transforms while remaining device-agnostic.
Features
AcceleratedDCTs.jl implements DCT-II and IDCT-II.
- Dimensions: Supports 1D, 2D, and 3D arrays.
- Device Agnostic: Built with
KernelAbstractions.jl, so it runs on both CPUs and GPUs. - Performance: Uses Real-to-Complex (R2C) FFTs instead of Complex-to-Complex (C2C) to improve memory usage and speed.
- Plans: supports plans and
mul!interface as in AbstractFFTs.jl.
Algorithm
The default implementation follows Makhoul’s algorithm, which utilizes a global pre-processing and post-processing step combined with a simultaneous multi-dimensional FFT. See documentation for more details.
For comparison, I also implemented:
dct_batch.jl: performs batched 1D FFTs along each dimension (labeled “Batched DCT” in benchmarks).dct_slow.jl: performs sequential 1D FFTs along each dimension.
Benchmarks
Here are some performance results for 3D DCTs on varying grid sizes (N^3). Results collected using in-place mul! to exclude allocation overhead. Lower is better.
GPU Performance (Nvidia RTX 2080 Ti)
| Grid Size (N^3) | cuFFT (Baseline) |
Opt 3D DCT |
Batched DCT (Old) |
|---|---|---|---|
| 16^3 | 0.068 ms | 0.104 ms | 0.883 ms |
| 32^3 | 0.064 ms | 0.117 ms | 0.908 ms |
| 64^3 | 0.112 ms | 0.237 ms | 1.138 ms |
| 128^3 | 0.818 ms | 1.414 ms | 3.228 ms |
| 256^3 | 5.980 ms | 10.455 ms | 23.120 ms |
Note:
Opt 3D DCTmaintains excellent performance across all sizes. For N=256, it is >2.2x faster than the batched implementation.
CPU Performance (8 Threads, Intel Xeon Gold 6132)
| Grid Size (N^3) | FFTW rfft |
Opt 3D DCT |
FFTW dct |
Batched DCT |
|---|---|---|---|---|
| 16^3 | 0.015 ms | 0.150 ms | 0.058 ms | 0.424 ms |
| 32^3 | 0.100 ms | 0.508 ms | 0.426 ms | 0.693 ms |
| 64^3 | 1.241 ms | 3.856 ms | 4.336 ms | 8.703 ms |
| 128^3 | 14.905 ms | 38.795 ms | 47.904 ms | 96.860 ms |
| 256^3 | 243.146 ms | 332.595 ms | 420.537 ms | 1066.093 ms |
Note: On multi-threaded CPU,
Opt 3D DCT(332ms) outperformsFFTW.dct(420ms) at large sizes (N=256) by being ~1.26x faster!
Feedback
I am looking for feedback from the community. If anyone can help verify the correctness of the results (they are verified preliminary in the test) or confirm the performance benefits, that would be very helpful. Pull requests are also welcome.
Repository: GitHub - liuyxpp/AcceleratedDCTs.jl: Device agnostic implementation of 1D/2D/3D DCTs.