# Help improving a function that constructs 3D matrix

Hi Everyone!
I am recently porting some functions from Matlab to Julia. Julia is mostly amazing except this function that I cannot get it to a similar speed as Matlab.
It basically takes a few input vectors and constructs a 3-dimensional matrix with summation along the first dimension of input arrays. I wrote three versions: 1st with broadcast and matrix multiplication, 2nd with multithreads loop and broadcast, and 3rd with multithreads explicit loop. What I don’t understand is that the explicit loop with the least allocation performs the worst.
Here is the code:

``````function combine_f_mul(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
y = transpose(y)
@inbounds for i = 1:nsum
@fastmath f_out .+= reshape(f[i,:],1,1,ndim3).*(exp.(1im*k_x[i].*x)*exp.(1im*k_y[i].*y))
end
return f_out
end

function combine_f_threads(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
for k = 1:ndim3
for s = 1:nsum
@inbounds @fastmath @views f_out[:,j,k] .+= f[s,k]*exp(1im*k_y[s]*y[j]).*exp.(1im*k_x[s].*x)
end
end
end
return f_out
end

function combine_f_threads_(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
nsum = size(f,1)
f_out = Array{eltype(f),3}(undef,ndim1,ndim2,ndim3)
i, j, k = getindex(id, 1), getindex(id,2), getindex(id,3)
acc = zero(eltype(f))
@simd for s = 1:nsum
@inbounds @fastmath acc += f[s,k]*exp(1im*k_x[s]*x[i])*exp(1im*k_y[s]*y[j])
end
f_out[id] = acc
end
return f_out
end

fsize = 1003;
f = rand(ComplexF64, fsize,2);
k_x = rand(ComplexF64, fsize); k_y = rand(ComplexF64, fsize);
x = collect(1:403); y = collect(1:203);

julia> @btime combine_f_mul(\$f, \$k_x, \$k_y, \$x, \$y);
492.731 ms (9029 allocations: 1.23 GiB)

julia> @btime combine_f_threads(\$f, \$k_x, \$k_y, \$x, \$y);
412.851 ms (814525 allocations: 52.22 MiB)

julia> @btime combine_f_threads_(\$f, \$k_x, \$k_y, \$x, \$y);
997.155 ms (46 allocations: 2.50 MiB)
``````

original Matlab version:

``````function f_out = combine_f(f, k_x, k_y, x, y)
f_out = zeros(length(x), length(y), size(f,2)) ;
y = y.' ;
f = reshape(f,size(f,1),1,size(f,2));
for i = 1 : size(f,1)
f_out = f_out + f(i,1,:).*exp(1i*k_x(i)*x).*exp(1i*k_y(i)*y) ;
end
end
``````

However, the above Matlab function with elementwise operation can do the job in only 179.2 ms (more than 2 times faster than my Julia implementation).
I also tried other packages such as Strided, LoopVectorization, Tullio. Those did not help much neither. The CPU does not support AVX512.
Any help?

Notice that your first version constructs a vector `exp.(1im*k_x[i].*x)` and a transposed vector `exp.(1im*k_y[i].*y)` and then multiply them with `*` to make a new matrix, every time. But replacing that with `.*` is much slower, as then the `exp.()` broadcast gets fused with the larger one, and `exp` is run N^2 times not 2N, per `i`. I think your explicit loop version has the same issue.

My `cf2()` tries to have the best of both worlds, just using broadcasting. `cf4()` adds threading. (TensorCast is just notation, instead of writing `reshape(@view(f[i,:]),1,1,ndim3)` etc. by hand.)

``````using TensorCast #, Strided

function cf2(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
f_out = zeros(eltype(f),ndim1,ndim2,ndim3);
ekx = zeros(eltype(f), ndim1) # cache for exp results
eky = zeros(eltype(f), ndim2)
for i in axes(f,1)
@cast ekx[a] = exp(im * k_x[\$i] * x[a]) # strided # is slower?
@cast eky[b] = exp(im * k_y[\$i] * y[b]) # strided
@cast f_out[a,b,c] += f[\$i,c] * ekx[a] * eky[b]
end
return f_out
end

function cf4(f, k_x, k_y, x, y)
ndim1 = length(x)
ndim2 = length(y)
ndim3 = size(f,2)
@cast ekx[a,\$tid] = exp(im * k_x[\$i] * x[a])
@cast eky[b,\$tid] = exp(im * k_y[\$i] * y[b])
@cast f_out[a,b,c,\$tid] += f[\$i,c] * ekx[a,\$tid] * eky[b,\$tid]
end
@reduce out[a,b,c] := sum(t) f_out[a,b,c,t]
end

cf2(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y)
cf4(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y)

@code_warntype cf2(f, k_x, k_y, x, y)
@code_warntype cf4(f, k_x, k_y, x, y)

#== times ==#

julia> @btime combine_f_mul(\$f, \$k_x, \$k_y, \$x, \$y);
682.006 ms (9029 allocations: 1.23 GiB)

julia> @btime cf2(\$f, \$k_x, \$k_y, \$x, \$y);
306.615 ms (3013 allocations: 2.64 MiB)

julia> @btime cf4(\$f, \$k_x, \$k_y, \$x, \$y);
81.502 ms (7088 allocations: 17.89 MiB)
``````

I think that LoopVectorization doesn’t like arrays of complex numbers, you could experiment with StructArrays instead. I tried using VML but this was slower:

``````using IntelVectorMath
...
@cast ekx[a] = (im * k_x[\$i] * x[a])
IVM.exp!(ekx)
...
``````
3 Likes

Yes. If it sees unsupported data types (like complex numbers), it’ll just run the original loop with `@inbounds @fastmath`.

Thank you for the solution and pointing out that exp terms needs to be handled carefully. I will look more into TensorCast.
My times with cf4() is 72 ms! This is why I love Julia 