[ANN] AcceleratedDCTs.jl: Device-agnostic 1D/2D/3D DCTs

liuyxpp · January 26, 2026, 2:38pm

Hi everyone,

I am excited to announce a new package, AcceleratedDCTs.jl.

Motivation
To improve the simulation tool of my research group, we require high-performance DCTs up to 3D, especially on GPU. While Oceananigans.jl includes a 1D DCT targeting GPU, there was no implementation available that fully supported N-dimensional transforms while remaining device-agnostic.

Features
AcceleratedDCTs.jl implements DCT-II and IDCT-II.

Dimensions: Supports 1D, 2D, and 3D arrays.
Device Agnostic: Built with KernelAbstractions.jl, so it runs on both CPUs and GPUs.
Performance: Uses Real-to-Complex (R2C) FFTs instead of Complex-to-Complex (C2C) to improve memory usage and speed.
Plans: supports plans and mul! interface as in AbstractFFTs.jl.

Algorithm
The default implementation follows Makhoul’s algorithm, which utilizes a global pre-processing and post-processing step combined with a simultaneous multi-dimensional FFT. See documentation for more details.

For comparison, I also implemented:

dct_batch.jl: performs batched 1D FFTs along each dimension (labeled “Batched DCT” in benchmarks).
dct_slow.jl: performs sequential 1D FFTs along each dimension.

Benchmarks
Here are some performance results for 3D DCTs on varying grid sizes (N^3). Results collected using in-place mul! to exclude allocation overhead. Lower is better.

GPU Performance (Nvidia RTX 2080 Ti)

Grid Size (N^3)	`cuFFT` (Baseline)	`Opt 3D DCT`	`Batched DCT` (Old)
16^3	0.068 ms	0.104 ms	0.883 ms
32^3	0.064 ms	0.117 ms	0.908 ms
64^3	0.112 ms	0.237 ms	1.138 ms
128^3	0.818 ms	1.414 ms	3.228 ms
256^3	5.980 ms	10.455 ms	23.120 ms

Note: Opt 3D DCT maintains excellent performance across all sizes. For N=256, it is >2.2x faster than the batched implementation.

CPU Performance (8 Threads, Intel Xeon Gold 6132)

Grid Size (N^3)	`FFTW rfft`	`Opt 3D DCT`	`FFTW dct`	`Batched DCT`
16^3	0.015 ms	0.150 ms	0.058 ms	0.424 ms
32^3	0.100 ms	0.508 ms	0.426 ms	0.693 ms
64^3	1.241 ms	3.856 ms	4.336 ms	8.703 ms
128^3	14.905 ms	38.795 ms	47.904 ms	96.860 ms
256^3	243.146 ms	332.595 ms	420.537 ms	1066.093 ms

Note: On multi-threaded CPU, Opt 3D DCT (332ms) outperforms FFTW.dct (420ms) at large sizes (N=256) by being ~1.26x faster!

Feedback
I am looking for feedback from the community. If anyone can help verify the correctness of the results (they are verified preliminary in the test) or confirm the performance benefits, that would be very helpful. Pull requests are also welcome.

Repository: GitHub - liuyxpp/AcceleratedDCTs.jl: Device agnostic implementation of 1D/2D/3D DCTs.

liuyxpp · February 23, 2026, 6:17am

New version: v0.4.1

This release integrates the newly registered VkDCT_jll (v1.3.4) package, eliminating the need for manual CUDA shim compilation. GPU-accelerated 3D DCT-I transforms based on VkFFT now work out of the box.

Unlike cuFFT, VkFFT provides DCTs natively and they are much faster than our own implementation in AcceleratedDCTs.jl.

Topic		Replies	Views
(discrete cosine transform) DCT on GPU General Usage fftw , gpu	4	840	November 2, 2021
FFTW.jl support fftw_plan_many_dft? Performance fftw	6	659	August 6, 2023
GPU/CPU Agnostic FFT code New to Julia gpu , hpc	7	561	June 10, 2025
Why is CUDA.FFT slow only when performed over the second dimension of a 3D array? GPU cuda , fft	0	93	January 29, 2025
[ANN] RustFFT.jl: Compute forward and inverse FFTs with RustFFT Package Announcements	13	1634	May 30, 2023

[ANN] AcceleratedDCTs.jl: Device-agnostic 1D/2D/3D DCTs

Related topics