I would like to filter hundreds of thousands of small images (say 35x35), using 5x5 kernels (that do not factor). I only want the “good data,” meaning the 31x31 filtered image that comes from correlating the filter at each location in the image without padding or wrapping.

I am hoping for some advice about how to do this in a computationally efficient manner in Julia. I’ve been a little overwhelmed by the number of ways to skin this cat, and am wondering what the best package/methods would be.

I suggest tiling the small images to form larger images and then computing the convolution using FFT.
I would try to get it as close to a square image as possible for the best efficiency.
Since you do not care about the boundary values, no padding between the image tiles will be needed.
Experiment to find the best tiling size to balance the cost of making the tiling vs the throughput speedup of larger FFTs.

Alternatively, consider it as a 35xm matrix, apply 1D FFTs on the 35 side, then transpose, apply 1D FFTs, apply the precomputed kernels in the Fourier domain, and do the inverse FFTs as on the first step.

Stencils.jl may be suitable, you can do the same trick of tiling the images, then move them to your GPU. Direct might beat the FFT if its GPU vs CPU and your kernel is only 5x5.

GPU and even parallel CPU will be slow for images that small in Stencils.jl - it needs a little more workload than that.