Using BenchmarkTools with CUDAnative and CuArrays and running out of CPU or GPU memory

I’m doing some simple “get acquainted” experimenting with convolutions using DSP, CUDAnative, CUDAdrv and CuArrays. I’m creating random 3-d arrays - rand(Float32, N, N, N). I then create “device” versions of the arrays by calling cu(a)

A = rand(Float32, N, N, N);
B = rand(Float32, N, N, N);
A_d = cu(A);
B_d = cu(B);

I’ve written a simple function to perform an convolution on a pair of arrays:

function cuFFT(A, B)
C = conv(A, B)
finalize( C )
C =

Finally, I use BenchmarkTools’s @benchmark macro:

@benchmark cuFFT($A_d, $B_d)

If I set N to, say, 64, Julia returns this error:

ERROR: LoadError: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory)

However if I set N 10 120, my script runs to completion.

When I originally posted my question, I was directly calling fft(). As I continued experimenting, I found that I was getting inconsistent results from run to run. However, I found that if I loaded DSP and called conv(), I was able to see consistent behavior, and the new puzzle that a larger N didn’t crash when a smaller N did. I also realized that I’d been assuming the problem was GPU memory, though the error message says “CPU or GPU”.

My question: is there some problem in the way that I’m calling DSP.conv(), or some setup that I need to do with BenchmarkTools?