Help improving a function that constructs 3D matrix

ycrichard · June 28, 2020, 8:57pm

Hi Everyone!
I am recently porting some functions from Matlab to Julia. Julia is mostly amazing except this function that I cannot get it to a similar speed as Matlab.
It basically takes a few input vectors and constructs a 3-dimensional matrix with summation along the first dimension of input arrays. I wrote three versions: 1st with broadcast and matrix multiplication, 2nd with multithreads loop and broadcast, and 3rd with multithreads explicit loop. What I don’t understand is that the explicit loop with the least allocation performs the worst.
Here is the code:

function combine_f_mul(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
    y = transpose(y)
    @inbounds for i = 1:nsum
        @fastmath f_out .+= reshape(f[i,:],1,1,ndim3).*(exp.(1im*k_x[i].*x)*exp.(1im*k_y[i].*y))
    end
    return f_out
end

function combine_f_threads(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3) ;
    for k = 1:ndim3
        Threads.@threads for j = 1:ndim2
            for s = 1:nsum
                @inbounds @fastmath @views f_out[:,j,k] .+= f[s,k]*exp(1im*k_y[s]*y[j]).*exp.(1im*k_x[s].*x)
            end
        end
    end
    return f_out
end

function combine_f_threads_(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    nsum = size(f,1)
    f_out = Array{eltype(f),3}(undef,ndim1,ndim2,ndim3)
    Threads.@threads for id in CartesianIndices(f_out)
        i, j, k = getindex(id, 1), getindex(id,2), getindex(id,3)
        acc = zero(eltype(f))
        @simd for s = 1:nsum
            @inbounds @fastmath acc += f[s,k]*exp(1im*k_x[s]*x[i])*exp(1im*k_y[s]*y[j])
        end
        f_out[id] = acc
    end
    return f_out
end

fsize = 1003;
f = rand(ComplexF64, fsize,2);
k_x = rand(ComplexF64, fsize); k_y = rand(ComplexF64, fsize);
x = collect(1:403); y = collect(1:203);

julia> @btime combine_f_mul($f, $k_x, $k_y, $x, $y);
  492.731 ms (9029 allocations: 1.23 GiB)

julia> @btime combine_f_threads($f, $k_x, $k_y, $x, $y);
  412.851 ms (814525 allocations: 52.22 MiB)

julia> @btime combine_f_threads_($f, $k_x, $k_y, $x, $y);
  997.155 ms (46 allocations: 2.50 MiB)

original Matlab version:

function f_out = combine_f(f, k_x, k_y, x, y)
    f_out = zeros(length(x), length(y), size(f,2)) ;
    y = y.' ;
    f = reshape(f,size(f,1),1,size(f,2));
    for i = 1 : size(f,1)
        f_out = f_out + f(i,1,:).*exp(1i*k_x(i)*x).*exp(1i*k_y(i)*y) ;
    end
end

However, the above Matlab function with elementwise operation can do the job in only 179.2 ms (more than 2 times faster than my Julia implementation).
I also tried other packages such as Strided, LoopVectorization, Tullio. Those did not help much neither. The CPU does not support AVX512.
Any help?

mcabbott · June 28, 2020, 9:55pm

Notice that your first version constructs a vector exp.(1im*k_x[i].*x) and a transposed vector exp.(1im*k_y[i].*y) and then multiply them with * to make a new matrix, every time. But replacing that with .* is much slower, as then the exp.() broadcast gets fused with the larger one, and exp is run N^2 times not 2N, per i. I think your explicit loop version has the same issue.

My cf2() tries to have the best of both worlds, just using broadcasting. cf4() adds threading. (TensorCast is just notation, instead of writing reshape(@view(f[i,:]),1,1,ndim3) etc. by hand.)

using TensorCast #, Strided

function cf2(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3);
    ekx = zeros(eltype(f), ndim1) # cache for exp results
    eky = zeros(eltype(f), ndim2)
    for i in axes(f,1)
        @cast ekx[a] = exp(im * k_x[$i] * x[a]) # strided # is slower?
        @cast eky[b] = exp(im * k_y[$i] * y[b]) # strided
        @cast f_out[a,b,c] += f[$i,c] * ekx[a] * eky[b]
    end
    return f_out
end

function cf4(f, k_x, k_y, x, y)
    ndim1 = length(x)
    ndim2 = length(y)
    ndim3 = size(f,2)
    f_out = zeros(eltype(f),ndim1,ndim2,ndim3, Threads.nthreads());
    ekx = zeros(eltype(f), ndim1, Threads.nthreads())
    eky = zeros(eltype(f), ndim2, Threads.nthreads())
    Threads.@threads for i in axes(f,1)
        tid = Threads.threadid()
        @cast ekx[a,$tid] = exp(im * k_x[$i] * x[a]) 
        @cast eky[b,$tid] = exp(im * k_y[$i] * y[b])
        @cast f_out[a,b,c,$tid] += f[$i,c] * ekx[a,$tid] * eky[b,$tid]
    end
    @reduce out[a,b,c] := sum(t) f_out[a,b,c,t]
end

cf2(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y)
cf4(f, k_x, k_y, x, y) ≈ combine_f_mul(f, k_x, k_y, x, y) 

@code_warntype cf2(f, k_x, k_y, x, y)
@code_warntype cf4(f, k_x, k_y, x, y)

#== times ==#

julia> @btime combine_f_mul($f, $k_x, $k_y, $x, $y);
  682.006 ms (9029 allocations: 1.23 GiB)

julia> @btime cf2($f, $k_x, $k_y, $x, $y);
  306.615 ms (3013 allocations: 2.64 MiB)

julia> @btime cf4($f, $k_x, $k_y, $x, $y);
  81.502 ms (7088 allocations: 17.89 MiB)

I think that LoopVectorization doesn’t like arrays of complex numbers, you could experiment with StructArrays instead. I tried using VML but this was slower:

using IntelVectorMath
...
        @cast ekx[a] = (im * k_x[$i] * x[a])
        IVM.exp!(ekx)
...

Elrod · June 28, 2020, 10:32pm

Yes. If it sees unsupported data types (like complex numbers), it’ll just run the original loop with @inbounds @fastmath.

ycrichard · June 28, 2020, 11:24pm

Thank you for the solution and pointing out that exp terms needs to be handled carefully. I will look more into TensorCast.
My times with cf4() is 72 ms! This is why I love Julia

Topic		Replies	Views
Slow complex exponential evaluation Numerics	8	2432	April 11, 2017
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8621	March 5, 2017
Sum operations between arrays Performance	21	5716	April 7, 2020
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6852	December 14, 2018
Elementwise multiplication of arrays across many cores General Usage parallel	5	2360	April 14, 2017

Help improving a function that constructs 3D matrix

Related topics