Hi there! I’m benchmarking a GPU implementation of a function performing downconversion of a signal given a carrier signal replica. The original form of the function in the library looks like this:
julia> function downconvert!(
downconverted_signal::StructArray,
signal::StructArray,
carrier_replica::StructArray,
start_sample::Integer,
num_samples_left::Integer
)
for i = start_sample:num_samples_left + start_sample - 1
downconverted_signal.re[i] = signal.re[i] * carrier.re[i] + signal.im[i] * carrier.im[i]
downconverted_signal.im[i] = signal.im[i] * carrier.re[i] - signal.re[i] * carrier.im[i]
end
return downconverted_signal
end
I’m trying out CuArrays to potentially better the performance. It doesn’t matter much if the downconversion gets performed in a specific range in the signal. Consider the for loop irrelevant. This is what I came up with:
julia> function downconvert2!(
downconverted_signal::CuArray{Complex{Float32}},
signal::CuArray{Complex{Float32}},
carrier::CuArray{Complex{Float32}},
start_sample::Integer,
num_samples_left::Integer
)
@. downconverted_signal = (real(signal) * real(carrier) + imag(signal) * imag(carrier))
+ 1im * (imag(signal) * real(carrier) - real(signal) * imag(carrier))
return downconverted_signal
end
julia> @benchmark downconvert2!(gpu_downconverted_signal, gpu_signal, gpu_carrier, 1, 2500)
BenchmarkTools.Trial:
memory estimate: 7.34 KiB
allocs estimate: 118
--------------
minimum time: 87.460 μs (0.00% GC)
median time: 100.357 μs (0.00% GC)
mean time: 104.403 μs (0.86% GC)
maximum time: 9.551 ms (94.39% GC)
--------------
samples: 10000
evals/sample: 1
To fit the original style I have then packed the CuArrays into a StructArray. This is quite advantageous as the CPU functions depend on the signals being kept in a StructArray. It eases the mutliple dispatch declarations. This is also the reason why unused variables are still kept as parameters.
julia> function downconvert2!(
downconverted_signal::StructArray,
signal::StructArray,
carrier::StructArray,
start_sample::Integer,
num_samples_left::Integer
)
@. downconverted_signal.re = signal.re * carrier.re + signal.im * carrier.im
@. downconverted_signal.im = signal.im * carrier.re - signal.re * carrier.im
return downconverted_signal
end
julia> @benchmark downconvert2!(s_gpu_downconverted_signal, s_gpu_signal, s_gpu_carrier, 1, 2500)
BenchmarkTools.Trial:
memory estimate: 7.72 KiB
allocs estimate: 162
--------------
minimum time: 145.736 μs (0.00% GC)
median time: 161.017 μs (0.00% GC)
mean time: 166.046 μs (0.72% GC)
maximum time: 12.791 ms (93.01% GC)
--------------
samples: 10000
evals/sample: 1
As seen in the results StructArray of CuArrays is a bit slower on average. Is there any way of making it faster? Am I missing something? Thanks!
Edit 1: Corrected variable names