How to accelerate the imfiter() operation?

My task contains hundreds of thousands of convolution operations like the following one. Each cost 11s now. I wonder if anyone can help to accelerate it.

using ImageFiltering
@time imfilter(Float32, rand(Int8, 300, 300, 300), rand(Int8, 100, 100, 100));

11.034938 seconds (83 allocations: 3.212 GiB, 4.00% gc time)

One way that may be useful is to call GPU while the command

imfilter(ArrayFireLibs(), rand(Int8, 300, 300, 300), rand(Int8, 100, 100, 100)) seems not to work as ArrayFire package cannot be installed by add ArrayFire.
Any help is greatly appreciated.

Your case is using β€œimages” with many channels.
It might be that your best bet is using Deep Learning based convolution kernel.
Though those are usually optimized to a small kernel, yet it is still worth trying.

The imfilter function is supposedly multithreaded but my 8 core machine runs at 24% utilitzation for your convolution example (on Julia v1.11).

Can your hundreds of thousands of convolutions be done in parallel? You might get better utilization this way. Unfortunately imfilter allocates a ton - this usually results in poor multithreading performance, since the garbage collector is not fully multithreaded.

There is an in place version of imfilter! but it allocates just marginally less than imfilter.

julia> @benchmark imfilter(Float32,$a,$b)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.246 s …    3.435 s  β”Š GC (min … max): 4.08% … 9.22%
 Time  (median):     3.340 s               β”Š GC (median):    6.72%
 Time  (mean Β± Οƒ):   3.340 s Β± 133.388 ms  β”Š GC (mean Β± Οƒ):  6.72% Β± 3.64%

  β–ˆ                                                        β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.25 s         Histogram: frequency by time         3.43 s <

 Memory estimate: 3.19 GiB, allocs estimate: 117.

julia> @benchmark imfilter!($blank,$a,$b)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.023 s …    3.206 s  β”Š GC (min … max): 4.82% … 9.61%
 Time  (median):     3.115 s               β”Š GC (median):    7.28%
 Time  (mean Β± Οƒ):   3.115 s Β± 129.296 ms  β”Š GC (mean Β± Οƒ):  7.28% Β± 3.39%

  β–ˆ                                                        β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.02 s         Histogram: frequency by time         3.21 s <

 Memory estimate: 3.09 GiB, allocs estimate: 114.

Given the size of your filter kernel imfilter! is most probably using the FFT in which case this function seems to be the source of the allocations:

function filtfft(A, krn)
    B = rfft(A)
    B .*= conj!(rfft(krn))
    irfft(B, length(axes(A, 1)))
end

If you could convince the authors of ImageFiltering.jl to make an in place version of this function, or write one yourself and submit a PR, you might get much better performance.

I’m not sure why you’re reviving an 11 months old question, but that imfilter is a 3D filter (i.e. three spatial dimensions) with a huge kernel, and in deep learning parlance with a single channel, so it seems unlikely that any deep learning library would be of much help.

I am not sure what you mean.
If imfilter applies a 3D convolution in the case above, then the OP should use Conv3d (Borrowing the naming form PyTorch).

In my post I already mentioned that NN operators are usually optimized for small kernels. So in the case above they might not be useful.
In light of that, I’m not sure what your comment is adding. Could you elaborate?

1 Like

Deep learning libraries are specialized for small kernels and multiple channels, neither of which is the case here. I mostly wanted to point out that the claim that this operation worked on many channels was a misunderstanding.

Addendum: Apologies for the inappropriate tone of my first message. I could have conveyed my points more professionally.

Hi,

In the OP I see rand(Int8, 300, 300, 300) which is a 300 channels image.
That what triggered me to suggest DL Libraries. I’d still give it a try.

It’s a matter of terminology but deep learning libraries distinguish between spatial dimensions and channels. A three-dimensional array could be two spatial dimensions and a channel dimension but in this case it isn’t since imfilter only works with spatial dimensions. Yes, you could append a singleton channel dimension and send it to a deep learning library with support for 3D convolutions and get a result, but it’s far from the use cases those libraries are optimized for.

Without testing, the fastest way for such a big kernel is FFT based.
This also runs on CUDA, below a code snippet:

using NDTools, CUDA, FFTW

# untested 
conv(x, y; dims=(1,2)) = irfft(rfft(x, dims) .* rfft(y, dims), size(x, dims[1]), dims)

x = rand(Int8, 300, 300, 300)
y = rand(Int8, 100, 100, 100)) 


# if the kernel is large in space, you might need zero padding before 
# and cropping after to avoid circular wrap arounds with FFT based confs
round.(Int8, conv(x, select_region(y, x)))
1 Like