[ANN] FastCUDASSIM.jl

eldee · March 25, 2026, 3:08pm

Hi everyone,

I’m happy to announce my first package: FastCUDASSIM.jl. It does what it says on the tin: efficiently compute the Structural Similarity Index Measure (SSIM) via CUDA.jl. Additionally we also (optionally) output gradients.

Performance

TLDR: 60x faster than previous Julia implementation!

As far as I know there currently exist two implementations in Julia: assess_ssim from ImageQualityIndexes.jl (CPU only) and ssim from SSIMLoss.jl. The SSIM calculation requires convolutions to obtain local image statistics. The straightforward approach, used by these packages, is then to use a high-level convolution operator like ImageFiltering.imfilter or NNLib.conv. From a GPU perspective the downside is that this requires many kernel launches and reading and writing to global memory multiple times. Instead FastCUDASSIM.jl tries to keep everything in shared memory as much as possible. The performance benefits can be seen in the benchmark below, where we compute the SSIM between two full HD RGB images.

# Note: FastCUDASSIM is not registered yet, so add from GitHub.

using CUDA, cuDNN  # cuDNN is needed for SSIMLoss's CUDA extension
using ImageQualityIndexes, SSIMLoss, FastCUDASSIM
using ImageCore
using Random, BenchmarkTools

c, h, w, b = 3, 1080, 1920, 1  # number of color channels, image height, width, batch size

println("ImageQualityIndexes (CPU):")
img1_iqi = rand(RGB{Float32}, h, w)
img2_iqi = rand(RGB{Float32}, h, w)
@benchmark(
    assess_ssim($img1_iqi, $img2_iqi; crop = false);
    setup = ( rand!($img1_iqi); rand!($img2_iqi) )
) |> display
println()

println("SSIMLoss (GPU):")
img1_sl = CUDA.rand(Float32, h, w, c, b)
img2_sl = CUDA.rand(Float32, h, w, c, b)
kernel = ssim_kernel(img1_sl)
@benchmark(
    CUDA.@sync SSIMLoss.ssim($img1_sl, $img2_sl, $kernel; crop = false);
    setup = ( rand!($img1_sl); rand!($img2_sl) )
) |> display
println()

println("FastCUDASSIM:")
img1_fcs = CUDA.rand(Float32, c, h, w)
img2_fcs = CUDA.rand(Float32, c, h, w)
@benchmark(
    CUDA.@sync FastCUDASSIM.ssim($img1_fcs, $img2_fcs);
    setup = ( rand!($img1_fcs); rand!($img2_fcs) )
) |> display

Results on an Intel i7-7700K CPU and NVIDIA RTX 3070 GPU:

ImageQualityIndexes (CPU):
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min … max):  452.374 ms … 969.059 ms  ┊ GC (min … max): 40.11% …  9.81%
 Time  (median):     542.845 ms               ┊ GC (median):    47.23%
 Time  (mean ± σ):   587.951 ms ± 163.271 ms  ┊ GC (mean ± σ):  41.55% ± 13.82%

  ▁█      ▁ ▁   ▁ ▁            ▁                              ▁  
  ██▁▁▁▁▁▁█▁█▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  452 ms           Histogram: frequency by time          969 ms <

 Memory estimate: 883.72 MiB, allocs estimate: 3063.

SSIMLoss (GPU):
BenchmarkTools.Trial: 82 samples with 1 evaluation per sample.
 Range (min … max):  56.141 ms … 112.839 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.190 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   61.553 ms ±  12.072 ms  ┊ GC (mean ± σ):  0.60% ± 1.69%

  ██
  ██▆▁▆▁▅▆▁▅▁▁▅▁▅▁▆▁▅▁▅▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▅ ▁
  56.1 ms       Histogram: log(frequency) by time       110 ms <

 Memory estimate: 109.14 KiB, allocs estimate: 2934.

FastCUDASSIM:
BenchmarkTools.Trial: 4858 samples with 1 evaluation per sample.
 Range (min … max):  913.800 μs …  1.823 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     978.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   991.398 μs ± 71.640 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          █▃
  ▂▂▂▂▃▃▄███▇▄▃▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▂▁▂▂▂▁▂▂▂▂▂ ▂
  914 μs          Histogram: frequency by time         1.34 ms <

 Memory estimate: 5.05 KiB, allocs estimate: 294.

So FastCUDASSIM.jl is about 60x faster than SSIMLoss.jl, the previously fastest Julia implementation!

Gradients

Unfortunately we cannot compare to SSIMLoss.jl for benchmarking the gradient computation speed, as this currently errors for SSIMLoss.jl. Presumably this is not the fault of the package itself (as it relies on autodiff for its gradients, which should in principle just work automatically), but on breaking changes in NNLib.jl or Zygote.jl.

julia> using Zygote

julia>  Zygote.gradient(x -> SSIMLoss.ssim(x, img2_sl), img1_sl)
ERROR: Gradient Thunk(ChainRules.var"#...) should be a tuple
[...]

In the Benchmarks page in the documentation, we then compare only to two Python packages.

Flexible inputs

In the benchmark above, we used a CuArray{Float32} of size (c, h, w, b), which is the image format we primarily had in mind when writing the package (in its current form). But thanks to some internal helper functions, the channels and batch axes are optional. Additionally, thanks to multiple dispatch, duck typing, and some more helper functions, other eltypes like RGB{N0f8} are also supported:

julia> using TestImages; using FastCUDASSIM: ssim  # (not ssim from SSIMLoss)

julia> ssim(cu(testimage("cameraman.tif")), cu(testimage("mandril_gray.tif")))
0.16261981f0

In particular, you can just load in images with Images.jl, put them on the GPU using cu, and start using our methods. You don’t have to worry about converting first to Float32 or permuting the axes. We do expect normalised images, with intensities in [0, 1].

Automatic differentiation

We provide ChainRules.jl rrule methods for ssim and dssim (where DSSIM = 1 - SSIM is the same as SSIMLoss.jl 's ssim_loss). These wrap our ssim_with_gradient and dssim_with_gradient. Zygote.jl is then supported out of the box.

julia> using Zygote

julia> imgs1 = CUDA.rand(1, 10, 20, 8); imgs2 = CUDA.rand(size(imgs1)...);

julia> Zygote.withgradient(x -> sum(dssim(x, imgs2)...), imgs1)
(val = 5.7896996f0, grad = (Float32[0.0054326034 ...]))

Next steps

From short-term to long-term:

FastCUDASSIM.jl is not yet registered. So currently you need to use ] add https://github.com/LaurensDiels/FastCUDASSIM.jl to install it. Assuming there are no complaints about the name or API, I will bump it to v1.0.0 and register it in the near future.
In the Python world there exists the package fused-ssim, which employs a similar strategy and achieves similar performance. Well, actually it is slightly more efficient (see the benchmark results in the documentation). We could probably close the gap by incorporating the last two bullet points from the fused-ssim repo: exploiting the symmetry of the (1D) Gaussian kernel, and merging the five forward pass convolutions into one, and similar for the backward pass. For Float32 3-channel images this would still fit in normal shared memory, so it’s definitely worth a try.
(On a side note, both FastCUDASSIM.jl and fused-ssim use images in (channels, height, width) order. But in Python land this is in row-major order, i.e. the memory order is actually completely opposite. In particular, for FastCUDASSIM.jl we cannot easily separate out the channels axis, which results in shared memory sizes scaling with it.)
From what I understand Enzyme.jl and Mooncake.jl are supplanting Zygote.jl as the go-to AD framework. So it makes sense to provide first class support, not passing through ChainRules.jl.
(Importing the rrules caused freezing for me, so that did not even work. Also, my initial attempts at implementing Enzyme’s augmented_primal and reverse always caused hard crashes, not exactly providing for the ideal development environment. But I can keep trying. ).
Port the code to KernelAbstractions.jl, resulting in a new package simply called FastSSIM.jl I guess. I don’t think we’re doing anything CUDA-specific in the code, so this should be possible. At this time I don’t have access to AMD/Intel/Apple GPUs, so I wouldn’t be able to test it, though.

Since this is my first package, it’s certainly possible I’m not following best practices everywhere. Feel free to suggest improvements! Also, the const settings, in particular the TILE_HEIGHT and TILE_WIDTH were tuned on a single GPU, the aforementioned RTX 3070. Given that warps have size 32, using TILE_HEIGHT == 32 probably makes sense, but I’m not sure why TILE_WIDTH == 8 provides a sweetspot (and I have some arguments why it shouldn’t ). So if anyone could verify that these values also work best on their GPUs, that’d be helpful.

eldee · April 4, 2026, 9:07am

FastCUDASSIM.jl is now registered! So you can install it simply via ] add FastCUDASSIM.

Contrary to my previous post, I have decided to start at v0.1.0. While I like the input and output formats and therefore do not intend to change any of the allocating API, I have no particular attachment to the format of the intermediate gradient buffers, which is exposed in the mutating gradient methods like ssim_gradient!. Assuming fused-ssim’s idea of merging the convolutions also helps here, this presumably would affect the optimal memory layout for these buffers. So I would like to keep the flexibility to change this layout for optimal performance. (And immediately moving from v1.0.0 to v2.0.0 sounds a lot sillier than going from v0.1.0 to v0.2.0.)

eldee · April 12, 2026, 12:59pm

v0.2.0 has now been released, bringing some efficiency improvements. We are now achieve essentially the same performance as fused-ssim!

To obtain these improvements we now incorporate fused-ssim’s grouping of the 5 forward, and of 3 backward convolutions, and its exploitation of the Gaussian kernel’s symmetry. This required removing the convolution helper functions (’ abstraction), which also allowed for further removing some redundancies (at the cost of somewhat uglier code). The breaking change of combining the intermediate gradient buffers into a height x 3 x channels x width x batch_size monstrosity additionally improves performance.

For greyscale image batches this all works great. But for RGB images in some sense the colour components get in the way of the convolutions. Experimentally I found it to work best to use an outer loop over the channels axis in the forward pass (as if we would be working with height x width x channel images, like fused-ssim), and inner loops in the backward pass. Presumably the difference is that in the forward pass we only need to read from channels x height x width x batch_size CuArrays, while in the backward pass we also need to write to one.

Performance comparison to v0.1.0

Tests on an RTX 3070 using a batch of 8 images of size 768 x 1024 (height x width) for either 1 or 3 colour channels c, in Float32 precision. The ‘B’ rows show the shared memory usage (forward; backward, if applicable) in bytes for the (timing) rows above.

	v0.1.0	v0.2.0
`ssim!`, `c = 1`	1.055 ms	0.862 ms (-18.3%)
	7 392 B	12 768 B
`ssim!`, `c = 3`	2.793 ms	2.415 ms (-13.5%)
	22 176 B	12 768 B
`ssim_with_gradient!`, `c = 1`	1.918 ms	1.503 ms (-21.6%)
	7 392 B; 4 368 B	12 768 B; 13 104 B
`ssim_with_gradient!`, `c = 3`	4.761 ms	4.590 ms (-3.6%)
	22 176 B; 13 104 B	12 768 B; 39 312 B

(Note that the absolute forward + backward improvement for c = 3 is smaller than the forward improvement. I.e. we actually regressed in the backward pass. Grouping the convolutions does not seem optimal here (and requires a lot more shared memory). But it certainly helps for greyscale image, and writing separate kernels depending on c is also not really ideal.)

Topic		Replies	Views
What's the status of image convolutions on CPU & GPU? GPU	39	6052	October 10, 2017
Autodiff on image convolution Machine Learning question	6	924	June 24, 2019
Speed Comparison Python v Julia for custom layers Machine Learning question , flux , python , machine-learning , speed-optimization	25	1738	July 28, 2023
CUDA.jl kernel is half as fast as c++ Kernel Performance cuda , cudajl	11	1733	September 26, 2022
Performance comparison - Flux.jl's Adam vs Jax's Adam Performance question , package , performance , flux	33	4275	November 3, 2022