Hi Everyone!

I am recently porting some functions from Matlab to Julia. Julia is mostly amazing except this function that I cannot get it to a similar speed as Matlab.

It basically takes a few input vectors and constructs a 3-dimensional matrix with summation along the first dimension of input arrays. I wrote three versions: 1st with broadcast and matrix multiplication, 2nd with multithreads loop and broadcast, and 3rd with multithreads explicit loop. What I don’t understand is that the explicit loop with the least allocation performs the worst.

Here is the code:

```
function combine_f_mul(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
y = transpose(y)
@inbounds for i = 1:nsum
@fastmath f_out .+= reshape(f[i,:],1,1,ndim3).*(exp.(1im*k_x[i].*x)*exp.(1im*k_y[i].*y))
end
return f_out
end
function combine_f_threads(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
for k = 1:ndim3
Threads.@threads for j = 1:ndim2
for s = 1:nsum
@inbounds @fastmath @views f_out[:,j,k] .+= f[s,k]*exp(1im*k_y[s]*y[j]).*exp.(1im*k_x[s].*x)
end
end
end
return f_out
end
function combine_f_threads_(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = Array{eltype(f),3}(undef,ndim1,ndim2,ndim3)
Threads.@threads for id in CartesianIndices(f_out)
i, j, k = getindex(id, 1), getindex(id,2), getindex(id,3)
acc = zero(eltype(f))
@simd for s = 1:nsum
@inbounds @fastmath acc += f[s,k]*exp(1im*k_x[s]*x[i])*exp(1im*k_y[s]*y[j])
end
f_out[id] = acc
end
return f_out
end
fsize = 1003;
f = rand(ComplexF64, fsize,2);
k_x = rand(ComplexF64, fsize); k_y = rand(ComplexF64, fsize);
x = collect(1:403); y = collect(1:203);
julia> @btime combine_f_mul($f, $k_x, $k_y, $x, $y);
492.731 ms (9029 allocations: 1.23 GiB)
julia> @btime combine_f_threads($f, $k_x, $k_y, $x, $y);
412.851 ms (814525 allocations: 52.22 MiB)
julia> @btime combine_f_threads_($f, $k_x, $k_y, $x, $y);
997.155 ms (46 allocations: 2.50 MiB)
```

original Matlab version:

```
function f_out = combine_f(f, k_x, k_y, x, y)
f_out = zeros(length(x), length(y), size(f,2)) ;
y = y.' ;
f = reshape(f,size(f,1),1,size(f,2));
for i = 1 : size(f,1)
f_out = f_out + f(i,1,:).*exp(1i*k_x(i)*x).*exp(1i*k_y(i)*y) ;
end
end
```

However, the above Matlab function with elementwise operation can do the job in only 179.2 ms (more than 2 times faster than my Julia implementation).

I also tried other packages such as Strided, LoopVectorization, Tullio. Those did not help much neither. The CPU does not support AVX512.

Any help?