Help improving a function that constructs 3D matrix

Hi Everyone!
I am recently porting some functions from Matlab to Julia. Julia is mostly amazing except this function that I cannot get it to a similar speed as Matlab.
It basically takes a few input vectors and constructs a 3-dimensional matrix with summation along the first dimension of input arrays. I wrote three versions: 1st with broadcast and matrix multiplication, 2nd with multithreads loop and broadcast, and 3rd with multithreads explicit loop. What I don’t understand is that the explicit loop with the least allocation performs the worst.
Here is the code:

function combine_f_mul(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
    y = transpose(y)
    @inbounds for i = 1:nsum
        @fastmath f_out .+= reshape(f[i,:],1,1,ndim3).*(exp.(1im*k_x[i].*x)*exp.(1im*k_y[i].*y))
    end
    return f_out
end

function combine_f_threads(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
    for k = 1:ndim3
        Threads.@threads for j = 1:ndim2
            for s = 1:nsum
                @inbounds @fastmath @views f_out[:,j,k] .+= f[s,k]*exp(1im*k_y[s]*y[j]).*exp.(1im*k_x[s].*x)
            end
        end
    end
    return f_out
end

function combine_f_threads_(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = Array{eltype(f),3}(undef,ndim1,ndim2,ndim3)
    Threads.@threads for id in CartesianIndices(f_out)
        i, j, k = getindex(id, 1), getindex(id,2), getindex(id,3)
        acc = zero(eltype(f))
        @simd for s = 1:nsum
            @inbounds @fastmath acc += f[s,k]*exp(1im*k_x[s]*x[i])*exp(1im*k_y[s]*y[j])
        end
        f_out[id] = acc
    end
    return f_out
end

fsize = 1003;
f = rand(ComplexF64, fsize,2);
k_x = rand(ComplexF64, fsize); k_y = rand(ComplexF64, fsize);
x = collect(1:403); y = collect(1:203);

julia> @btime combine_f_mul($f, $k_x, $k_y, $x, $y);
  492.731 ms (9029 allocations: 1.23 GiB)

julia> @btime combine_f_threads($f, $k_x, $k_y, $x, $y);
  412.851 ms (814525 allocations: 52.22 MiB)

julia> @btime combine_f_threads_($f, $k_x, $k_y, $x, $y);
  997.155 ms (46 allocations: 2.50 MiB)

original Matlab version:

function f_out = combine_f(f, k_x, k_y, x, y)
    f_out = zeros(length(x), length(y), size(f,2)) ;
    y = y.' ;
    f = reshape(f,size(f,1),1,size(f,2));
    for i = 1 : size(f,1)
        f_out = f_out + f(i,1,:).*exp(1i*k_x(i)*x).*exp(1i*k_y(i)*y) ;
    end
end

However, the above Matlab function with elementwise operation can do the job in only 179.2 ms (more than 2 times faster than my Julia implementation).
I also tried other packages such as Strided, LoopVectorization, Tullio. Those did not help much neither. The CPU does not support AVX512.
Any help?

Notice that your first version constructs a vector exp.(1im*k_x[i].*x) and a transposed vector exp.(1im*k_y[i].*y) and then multiply them with * to make a new matrix, every time. But replacing that with .* is much slower, as then the exp.() broadcast gets fused with the larger one, and exp is run N^2 times not 2N, per i. I think your explicit loop version has the same issue.

My cf2() tries to have the best of both worlds, just using broadcasting. cf4() adds threading. (TensorCast is just notation, instead of writing reshape(@view(f[i,:]),1,1,ndim3) etc. by hand.)

using TensorCast #, Strided

function cf2(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3);
    ekx = zeros(eltype(f), ndim1) # cache for exp results
    eky = zeros(eltype(f), ndim2)
    for i in axes(f,1)
        @cast ekx[a] = exp(im * k_x[$i] * x[a]) # strided # is slower?
        @cast eky[b] = exp(im * k_y[$i] * y[b]) # strided
        @cast f_out[a,b,c] += f[$i,c] * ekx[a] * eky[b]
    end
    return f_out
end

function cf4(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3, Threads.nthreads());
    ekx = zeros(eltype(f), ndim1, Threads.nthreads())
    eky = zeros(eltype(f), ndim2, Threads.nthreads())
    Threads.@threads for i in axes(f,1)
        tid = Threads.threadid()
        @cast ekx[a,$tid] = exp(im * k_x[$i] * x[a]) 
        @cast eky[b,$tid] = exp(im * k_y[$i] * y[b])
        @cast f_out[a,b,c,$tid] += f[$i,c] * ekx[a,$tid] * eky[b,$tid]
    end
    @reduce out[a,b,c] := sum(t) f_out[a,b,c,t]
end

cf2(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y)
cf4(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y) 

@code_warntype cf2(f, k_x, k_y, x, y)
@code_warntype cf4(f, k_x, k_y, x, y)

#== times ==#

julia> @btime combine_f_mul($f, $k_x, $k_y, $x, $y);
  682.006 ms (9029 allocations: 1.23 GiB)

julia> @btime cf2($f, $k_x, $k_y, $x, $y);
  306.615 ms (3013 allocations: 2.64 MiB)

julia> @btime cf4($f, $k_x, $k_y, $x, $y);
  81.502 ms (7088 allocations: 17.89 MiB)

I think that LoopVectorization doesn’t like arrays of complex numbers, you could experiment with StructArrays instead. I tried using VML but this was slower:

using IntelVectorMath
...
        @cast ekx[a] = (im * k_x[$i] * x[a])
        IVM.exp!(ekx)
...
3 Likes

Yes. If it sees unsupported data types (like complex numbers), it’ll just run the original loop with @inbounds @fastmath.

Thank you for the solution and pointing out that exp terms needs to be handled carefully. I will look more into TensorCast.
My times with cf4() is 72 ms! This is why I love Julia :slight_smile: