Hi everyone,
Iβm happy to announce my first package: FastCUDASSIM.jl. It does what it says on the tin: efficiently compute the Structural Similarity Index Measure (SSIM) via CUDA.jl. Additionally we also (optionally) output gradients.
Performance
As far as I know there currently exist two implementations in Julia: assess_ssim from ImageQualityIndexes.jl (CPU only) and ssim from SSIMLoss.jl. The SSIM calculation requires convolutions to obtain local image statistics. The straightforward approach, used by these packages, is then to use a high-level convolution operator like ImageFiltering.imfilter or NNLib.conv. From a GPU perspective the downside is that this requires many kernel launches and reading and writing to global memory multiple times. Instead FastCUDASSIM.jl tries to keep everything in shared memory as much as possible. The performance benefits can be seen in the benchmark below, where we compute the SSIM between two full HD RGB images.
# Note: FastCUDASSIM is not registered yet, so add from GitHub.
using CUDA, cuDNN # cuDNN is needed for SSIMLoss's CUDA extension
using ImageQualityIndexes, SSIMLoss, FastCUDASSIM
using ImageCore
using Random, BenchmarkTools
c, h, w, b = 3, 1080, 1920, 1 # number of color channels, image height, width, batch size
println("ImageQualityIndexes (CPU):")
img1_iqi = rand(RGB{Float32}, h, w)
img2_iqi = rand(RGB{Float32}, h, w)
@benchmark(
assess_ssim($img1_iqi, $img2_iqi; crop = false);
setup = ( rand!($img1_iqi); rand!($img2_iqi) )
) |> display
println()
println("SSIMLoss (GPU):")
img1_sl = CUDA.rand(Float32, h, w, c, b)
img2_sl = CUDA.rand(Float32, h, w, c, b)
kernel = ssim_kernel(img1_sl)
@benchmark(
CUDA.@sync SSIMLoss.ssim($img1_sl, $img2_sl, $kernel; crop = false);
setup = ( rand!($img1_sl); rand!($img2_sl) )
) |> display
println()
println("FastCUDASSIM:")
img1_fcs = CUDA.rand(Float32, c, h, w)
img2_fcs = CUDA.rand(Float32, c, h, w)
@benchmark(
CUDA.@sync FastCUDASSIM.ssim($img1_fcs, $img2_fcs);
setup = ( rand!($img1_fcs); rand!($img2_fcs) )
) |> display
Results on an Intel i7-7700K CPU and NVIDIA RTX 3070 GPU:
ImageQualityIndexes (CPU):
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
Range (min β¦ max): 452.374 ms β¦ 969.059 ms β GC (min β¦ max): 40.11% β¦ 9.81%
Time (median): 542.845 ms β GC (median): 47.23%
Time (mean Β± Ο): 587.951 ms Β± 163.271 ms β GC (mean Β± Ο): 41.55% Β± 13.82%
ββ β β β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
452 ms Histogram: frequency by time 969 ms <
Memory estimate: 883.72 MiB, allocs estimate: 3063.
SSIMLoss (GPU):
BenchmarkTools.Trial: 82 samples with 1 evaluation per sample.
Range (min β¦ max): 56.141 ms β¦ 112.839 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 57.190 ms β GC (median): 0.00%
Time (mean Β± Ο): 61.553 ms Β± 12.072 ms β GC (mean Β± Ο): 0.60% Β± 1.69%
ββ
βββββββ
βββ
βββ
ββ
ββββ
ββ
ββββββββββββ
βββββββββββ
βββ
ββββββ
βββββββ
β
56.1 ms Histogram: log(frequency) by time 110 ms <
Memory estimate: 109.14 KiB, allocs estimate: 2934.
FastCUDASSIM:
BenchmarkTools.Trial: 4858 samples with 1 evaluation per sample.
Range (min β¦ max): 913.800 ΞΌs β¦ 1.823 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 978.250 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 991.398 ΞΌs Β± 71.640 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
914 ΞΌs Histogram: frequency by time 1.34 ms <
Memory estimate: 5.05 KiB, allocs estimate: 294.
So FastCUDASSIM.jl is about 60x faster than SSIMLoss.jl, the previously fastest Julia implementation!
Gradients
Unfortunately we cannot compare to SSIMLoss.jl for benchmarking the gradient computation speed, as this currently errors for SSIMLoss.jl. Presumably this is not the fault of the package itself (as it relies on autodiff for its gradients, which should in principle just work automatically), but on breaking changes in NNLib.jl or Zygote.jl.
julia> using Zygote
julia> Zygote.gradient(x -> SSIMLoss.ssim(x, img2_sl), img1_sl)
ERROR: Gradient Thunk(ChainRules.var"#...) should be a tuple
[...]
In the Benchmarks page in the documentation, we then compare only to two Python packages.
Flexible inputs
In the benchmark above, we used a CuArray{Float32} of size (c, h, w, b), which is the image format we primarily had in mind when writing the package (in its current form). But thanks to some internal helper functions, the channels and batch axes are optional. Additionally, thanks to multiple dispatch, duck typing, and some more helper functions, other eltypes like RGB{N0f8} are also supported:
julia> using TestImages; using FastCUDASSIM: ssim # (not ssim from SSIMLoss)
julia> ssim(cu(testimage("cameraman.tif")), cu(testimage("mandril_gray.tif")))
0.16261981f0
In particular, you can just load in images with Images.jl, put them on the GPU using cu, and start using our methods. You donβt have to worry about converting first to Float32 or permuting the axes. We do expect normalised images, with intensities in [0, 1].
Automatic differentiation
We provide ChainRules.jl rrule methods for ssim and dssim (where DSSIM = 1 - SSIM is the same as SSIMLoss.jl 's ssim_loss). These wrap our ssim_with_gradient and dssim_with_gradient. Zygote.jl is then supported out of the box.
julia> using Zygote
julia> imgs1 = CUDA.rand(1, 10, 20, 8); imgs2 = CUDA.rand(size(imgs1)...);
julia> Zygote.withgradient(x -> sum(dssim(x, imgs2)...), imgs1)
(val = 5.7896996f0, grad = (Float32[0.0054326034 ...]))
Next steps
From short-term to long-term:
- FastCUDASSIM.jl is not yet registered. So currently you need to use
] add https://github.com/LaurensDiels/FastCUDASSIM.jlto install it. Assuming there are no complaints about the name or API, I will bump it to v1.0.0 and register it in the near future. - In the Python world there exists the package fused-ssim, which employs a similar strategy and achieves similar performance. Well, actually it is slightly more efficient (see the benchmark results in the documentation). We could probably close the gap by incorporating the last two bullet points from the fused-ssim repo: exploiting the symmetry of the (1D) Gaussian kernel, and merging the five forward pass convolutions into one, and similar for the backward pass. For
Float323-channel images this would still fit in normal shared memory, so itβs definitely worth a try.
(On a side note, both FastCUDASSIM.jl and fused-ssim use images in(channels, height, width)order. But in Python land this is in row-major order, i.e. the memory order is actually completely opposite. In particular, for FastCUDASSIM.jl we cannot easily separate out the channels axis, which results in shared memory sizes scaling with it.) - From what I understand Enzyme.jl and Mooncake.jl are supplanting Zygote.jl as the go-to AD framework. So it makes sense to provide first class support, not passing through ChainRules.jl.
(Importing therrules caused freezing for me, so that did not even work. Also, my initial attempts at implementing Enzymeβsaugmented_primalandreversealways caused hard crashes, not exactly providing for the ideal development environment. But I can keep trying.
). - Port the code to KernelAbstractions.jl, resulting in a new package simply called FastSSIM.jl I guess. I donβt think weβre doing anything CUDA-specific in the code, so this should be possible. At this time I donβt have access to AMD/Intel/Apple GPUs, so I wouldnβt be able to test it, though.
Since this is my first package, itβs certainly possible Iβm not following best practices everywhere. Feel free to suggest improvements! Also, the const settings, in particular the TILE_HEIGHT and TILE_WIDTH were tuned on a single GPU, the aforementioned RTX 3070. Given that warps have size 32, using TILE_HEIGHT == 32 probably makes sense, but Iβm not sure why TILE_WIDTH == 8 provides a sweetspot (and I have some arguments why it shouldnβt
). So if anyone could verify that these values also work best on their GPUs, thatβd be helpful.