[ANN] FastCUDASSIM.jl

Hi everyone,

I’m happy to announce my first package: FastCUDASSIM.jl. It does what it says on the tin: efficiently compute the Structural Similarity Index Measure (SSIM) via CUDA.jl. Additionally we also (optionally) output gradients.

Performance

As far as I know there currently exist two implementations in Julia: assess_ssim from ImageQualityIndexes.jl (CPU only) and ssim from SSIMLoss.jl. The SSIM calculation requires convolutions to obtain local image statistics. The straightforward approach, used by these packages, is then to use a high-level convolution operator like ImageFiltering.imfilter or NNLib.conv. From a GPU perspective the downside is that this requires many kernel launches and reading and writing to global memory multiple times. Instead FastCUDASSIM.jl tries to keep everything in shared memory as much as possible. The performance benefits can be seen in the benchmark below, where we compute the SSIM between two full HD RGB images.

# Note: FastCUDASSIM is not registered yet, so add from GitHub.

using CUDA, cuDNN  # cuDNN is needed for SSIMLoss's CUDA extension
using ImageQualityIndexes, SSIMLoss, FastCUDASSIM
using ImageCore
using Random, BenchmarkTools

c, h, w, b = 3, 1080, 1920, 1  # number of color channels, image height, width, batch size

println("ImageQualityIndexes (CPU):")
img1_iqi = rand(RGB{Float32}, h, w)
img2_iqi = rand(RGB{Float32}, h, w)
@benchmark(
    assess_ssim($img1_iqi, $img2_iqi; crop = false);
    setup = ( rand!($img1_iqi); rand!($img2_iqi) )
) |> display
println()

println("SSIMLoss (GPU):")
img1_sl = CUDA.rand(Float32, h, w, c, b)
img2_sl = CUDA.rand(Float32, h, w, c, b)
kernel = ssim_kernel(img1_sl)
@benchmark(
    CUDA.@sync SSIMLoss.ssim($img1_sl, $img2_sl, $kernel; crop = false);
    setup = ( rand!($img1_sl); rand!($img2_sl) )
) |> display
println()

println("FastCUDASSIM:")
img1_fcs = CUDA.rand(Float32, c, h, w)
img2_fcs = CUDA.rand(Float32, c, h, w)
@benchmark(
    CUDA.@sync FastCUDASSIM.ssim($img1_fcs, $img2_fcs);
    setup = ( rand!($img1_fcs); rand!($img2_fcs) )
) |> display

Results on an Intel i7-7700K CPU and NVIDIA RTX 3070 GPU:

ImageQualityIndexes (CPU):
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min … max):  452.374 ms … 969.059 ms  β”Š GC (min … max): 40.11% …  9.81%
 Time  (median):     542.845 ms               β”Š GC (median):    47.23%
 Time  (mean Β± Οƒ):   587.951 ms Β± 163.271 ms  β”Š GC (mean Β± Οƒ):  41.55% Β± 13.82%

  β–β–ˆ      ▁ ▁   ▁ ▁            ▁                              ▁  
  β–ˆβ–ˆβ–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  452 ms           Histogram: frequency by time          969 ms <

 Memory estimate: 883.72 MiB, allocs estimate: 3063.

SSIMLoss (GPU):
BenchmarkTools.Trial: 82 samples with 1 evaluation per sample.
 Range (min … max):  56.141 ms … 112.839 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     57.190 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   61.553 ms Β±  12.072 ms  β”Š GC (mean Β± Οƒ):  0.60% Β± 1.69%

  β–ˆβ–ˆ
  β–ˆβ–ˆβ–†β–β–†β–β–…β–†β–β–…β–β–β–…β–β–…β–β–†β–β–…β–β–…β–β–β–β–β–β–β–β–β–β–β–β–…β–β–β–β–β–β–β–β–β–β–β–…β–β–β–…β–β–β–β–β–β–…β–β–β–β–β–β–β–… ▁
  56.1 ms       Histogram: log(frequency) by time       110 ms <

 Memory estimate: 109.14 KiB, allocs estimate: 2934.

FastCUDASSIM:
BenchmarkTools.Trial: 4858 samples with 1 evaluation per sample.
 Range (min … max):  913.800 ΞΌs …  1.823 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     978.250 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   991.398 ΞΌs Β± 71.640 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          β–ˆβ–ƒ
  β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–„β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–β–β–β–‚β–β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚ β–‚
  914 ΞΌs          Histogram: frequency by time         1.34 ms <

 Memory estimate: 5.05 KiB, allocs estimate: 294.

So FastCUDASSIM.jl is about 60x faster than SSIMLoss.jl, the previously fastest Julia implementation!

Gradients

Unfortunately we cannot compare to SSIMLoss.jl for benchmarking the gradient computation speed, as this currently errors for SSIMLoss.jl. Presumably this is not the fault of the package itself (as it relies on autodiff for its gradients, which should in principle just work automatically), but on breaking changes in NNLib.jl or Zygote.jl.

julia> using Zygote

julia>  Zygote.gradient(x -> SSIMLoss.ssim(x, img2_sl), img1_sl)
ERROR: Gradient Thunk(ChainRules.var"#...) should be a tuple
[...]

In the Benchmarks page in the documentation, we then compare only to two Python packages.

Flexible inputs

In the benchmark above, we used a CuArray{Float32} of size (c, h, w, b), which is the image format we primarily had in mind when writing the package (in its current form). But thanks to some internal helper functions, the channels and batch axes are optional. Additionally, thanks to multiple dispatch, duck typing, and some more helper functions, other eltypes like RGB{N0f8} are also supported:

julia> using TestImages; using FastCUDASSIM: ssim  # (not ssim from SSIMLoss)

julia> ssim(cu(testimage("cameraman.tif")), cu(testimage("mandril_gray.tif")))
0.16261981f0

In particular, you can just load in images with Images.jl, put them on the GPU using cu, and start using our methods. You don’t have to worry about converting first to Float32 or permuting the axes. We do expect normalised images, with intensities in [0, 1].

Automatic differentiation

We provide ChainRules.jl rrule methods for ssim and dssim (where DSSIM = 1 - SSIM is the same as SSIMLoss.jl 's ssim_loss). These wrap our ssim_with_gradient and dssim_with_gradient. Zygote.jl is then supported out of the box.

julia> using Zygote

julia> imgs1 = CUDA.rand(1, 10, 20, 8); imgs2 = CUDA.rand(size(imgs1)...);

julia> Zygote.withgradient(x -> sum(dssim(x, imgs2)...), imgs1)
(val = 5.7896996f0, grad = (Float32[0.0054326034 ...]))

Next steps

From short-term to long-term:

  1. FastCUDASSIM.jl is not yet registered. So currently you need to use ] add https://github.com/LaurensDiels/FastCUDASSIM.jl to install it. Assuming there are no complaints about the name or API, I will bump it to v1.0.0 and register it in the near future.
  2. In the Python world there exists the package fused-ssim, which employs a similar strategy and achieves similar performance. Well, actually it is slightly more efficient (see the benchmark results in the documentation). We could probably close the gap by incorporating the last two bullet points from the fused-ssim repo: exploiting the symmetry of the (1D) Gaussian kernel, and merging the five forward pass convolutions into one, and similar for the backward pass. For Float32 3-channel images this would still fit in normal shared memory, so it’s definitely worth a try.
    (On a side note, both FastCUDASSIM.jl and fused-ssim use images in (channels, height, width) order. But in Python land this is in row-major order, i.e. the memory order is actually completely opposite. In particular, for FastCUDASSIM.jl we cannot easily separate out the channels axis, which results in shared memory sizes scaling with it.)
  3. From what I understand Enzyme.jl and Mooncake.jl are supplanting Zygote.jl as the go-to AD framework. So it makes sense to provide first class support, not passing through ChainRules.jl.
    (Importing the rrules caused freezing for me, so that did not even work. Also, my initial attempts at implementing Enzyme’s augmented_primal and reverse always caused hard crashes, not exactly providing for the ideal development environment. But I can keep trying. :slight_smile: ).
  4. Port the code to KernelAbstractions.jl, resulting in a new package simply called FastSSIM.jl I guess. I don’t think we’re doing anything CUDA-specific in the code, so this should be possible. At this time I don’t have access to AMD/Intel/Apple GPUs, so I wouldn’t be able to test it, though.

Since this is my first package, it’s certainly possible I’m not following best practices everywhere. Feel free to suggest improvements! Also, the const settings, in particular the TILE_HEIGHT and TILE_WIDTH were tuned on a single GPU, the aforementioned RTX 3070. Given that warps have size 32, using TILE_HEIGHT == 32 probably makes sense, but I’m not sure why TILE_WIDTH == 8 provides a sweetspot (and I have some arguments why it shouldn’t :slight_smile: ). So if anyone could verify that these values also work best on their GPUs, that’d be helpful.

1 Like