How to accelerate the imfiter() operation?

RunguangLi · August 18, 2023, 7:32pm

My task contains hundreds of thousands of convolution operations like the following one. Each cost 11s now. I wonder if anyone can help to accelerate it.

using ImageFiltering
@time imfilter(Float32, rand(Int8, 300, 300, 300), rand(Int8, 100, 100, 100));

11.034938 seconds (83 allocations: 3.212 GiB, 4.00% gc time)

One way that may be useful is to call GPU while the command

imfilter(ArrayFireLibs(), rand(Int8, 300, 300, 300), rand(Int8, 100, 100, 100)) seems not to work as ArrayFire package cannot be installed by add ArrayFire.
Any help is greatly appreciated.

RoyiAvital · July 6, 2024, 9:57am

Your case is using “images” with many channels.
It might be that your best bet is using Deep Learning based convolution kernel.
Though those are usually optimized to a small kernel, yet it is still worth trying.

brianguenter · July 6, 2024, 5:40pm

The imfilter function is supposedly multithreaded but my 8 core machine runs at 24% utilitzation for your convolution example (on Julia v1.11).

Can your hundreds of thousands of convolutions be done in parallel? You might get better utilization this way. Unfortunately imfilter allocates a ton - this usually results in poor multithreading performance, since the garbage collector is not fully multithreaded.

There is an in place version of imfilter! but it allocates just marginally less than imfilter.

julia> @benchmark imfilter(Float32,$a,$b)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.246 s …    3.435 s  ┊ GC (min … max): 4.08% … 9.22%
 Time  (median):     3.340 s               ┊ GC (median):    6.72%
 Time  (mean ± σ):   3.340 s ± 133.388 ms  ┊ GC (mean ± σ):  6.72% ± 3.64%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.25 s         Histogram: frequency by time         3.43 s <

 Memory estimate: 3.19 GiB, allocs estimate: 117.

julia> @benchmark imfilter!($blank,$a,$b)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.023 s …    3.206 s  ┊ GC (min … max): 4.82% … 9.61%
 Time  (median):     3.115 s               ┊ GC (median):    7.28%
 Time  (mean ± σ):   3.115 s ± 129.296 ms  ┊ GC (mean ± σ):  7.28% ± 3.39%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.02 s         Histogram: frequency by time         3.21 s <

 Memory estimate: 3.09 GiB, allocs estimate: 114.

Given the size of your filter kernel imfilter! is most probably using the FFT in which case this function seems to be the source of the allocations:

function filtfft(A, krn)
    B = rfft(A)
    B .*= conj!(rfft(krn))
    irfft(B, length(axes(A, 1)))
end

If you could convince the authors of ImageFiltering.jl to make an in place version of this function, or write one yourself and submit a PR, you might get much better performance.

GunnarFarneback · July 6, 2024, 7:52pm

I’m not sure why you’re reviving an 11 months old question, but that imfilter is a 3D filter (i.e. three spatial dimensions) with a huge kernel, and in deep learning parlance with a single channel, so it seems unlikely that any deep learning library would be of much help.

RoyiAvital · July 6, 2024, 10:57pm

I am not sure what you mean.
If imfilter applies a 3D convolution in the case above, then the OP should use Conv3d (Borrowing the naming form PyTorch).

In my post I already mentioned that NN operators are usually optimized for small kernels. So in the case above they might not be useful.
In light of that, I’m not sure what your comment is adding. Could you elaborate?

GunnarFarneback · July 7, 2024, 7:44am

Deep learning libraries are specialized for small kernels and multiple channels, neither of which is the case here. I mostly wanted to point out that the claim that this operation worked on many channels was a misunderstanding.

Addendum: Apologies for the inappropriate tone of my first message. I could have conveyed my points more professionally.

RoyiAvital · July 8, 2024, 7:28am

Hi,

In the OP I see rand(Int8, 300, 300, 300) which is a 300 channels image.
That what triggered me to suggest DL Libraries. I’d still give it a try.

GunnarFarneback · July 8, 2024, 7:50am

It’s a matter of terminology but deep learning libraries distinguish between spatial dimensions and channels. A three-dimensional array could be two spatial dimensions and a channel dimension but in this case it isn’t since imfilter only works with spatial dimensions. Yes, you could append a singleton channel dimension and send it to a deep learning library with support for 3D convolutions and get a result, but it’s far from the use cases those libraries are optimized for.

roflmaostc · July 8, 2024, 11:06am

Without testing, the fastest way for such a big kernel is FFT based.
This also runs on CUDA, below a code snippet:

using NDTools, CUDA, FFTW

# untested 
conv(x, y; dims=(1,2)) = irfft(rfft(x, dims) .* rfft(y, dims), size(x, dims[1]), dims)

x = rand(Int8, 300, 300, 300)
y = rand(Int8, 100, 100, 100)) 


# if the kernel is large in space, you might need zero padding before 
# and cropping after to avoid circular wrap arounds with FFT based confs
round.(Int8, conv(x, select_region(y, x)))

Topic		Replies	Views
What's the status of image convolutions on CPU & GPU? GPU	39	5729	October 10, 2017
Repeated Convolutions With Large 1D Arrays Performance question , dsp , fft	20	382	April 14, 2025
ImageFiltering: Optimizing frame-by-frame image illumination flattening General Usage images	8	1333	November 13, 2018
Convolution (conv) with same size output General Usage dsp	20	12908	August 14, 2024
Images imfilter performance drops when using ImageView Performance	2	573	March 31, 2020

How to accelerate the imfiter() operation?

Related topics