Sum operations between arrays

Hello there,

I was wondering if you knew how to efficiently compute the following matrices in Julia. In Matlab I am getting the result 3-6x faster than in Julia, which makes me think that there is room for improvement perhaps? Since I need to compute these thousands of times in my code, I am really interested to optimize this bit…

So, here is the computation in Matlab. I take advantage of the fact that I can perform the “all-against-all” summation in matlab by summing a Nx1 with a 1xN vector, which gives the NxN matrix. In Julia, as far as I know, it is not possible.

function h = myfunc(v,vp)
% Both v and vp are column vectors

f = (v.^(1/3)+(vp.^(1/3))').^2.*((v.^(2/9)+(vp.^(2/9))').^(1/2)); 
g = exp(-(((v*vp').^(1/3))./(v.^(1/3)+(vp.^(1/3))')).^4);
h = f.*g;

end

I test the function with the following:

x=linspace(1,1000,1000)';
tic
[f,  g] = myfun(x,x);
toc

Result:

Elapsed time is 0.093312 seconds.

Now, in Julia, since I cannot perform the NxN summation this way, I can only think of a for-loop. So I use two functions:

function product(v,vp)
    f = zeros(Float64,length(v),length(vp));
    g = zeros(Float64,length(v),length(vp));
    for i in eachindex(v), j in eachindex(vp)
            @inbounds f[j,i] = (v[i]^(1/3)+(vp[j]^(1/3)))^2 *((v[i]^(2/9)+(vp[j]^(2/9)))^(1/2))*2^(1/3); 
            @inbounds g[j,i] = exp(-(((v[i]*vp[j])^(1/3))/(v[i]^(1/3)+(vp[j]^(1/3))))^4);
        end
    return f, g
end
function myfun(v,vp)
    f, g = product(v,vp);
    h = @. f*g;
    return h
end

I test with the same input I used in Matlab (x=collect(1:1000)*1.0) and, using BenchmarkTools I get:

596.179 ms (8 allocations: 22.89 MiB)

So Matlab is 6.3 times faster.

Now, I realize that, since I am evaluating this function using the same vector for both inputs, the result is a symmetric matrix. So I can improve things using LinearAlgebra:

function product2(v,vp)
    f = zeros(Float64,length(v),length(vp));
    g = zeros(Float64,length(v),length(vp));
    N = 1;
    for i in eachindex(v) 
        for j = 1:N
            @inbounds f[j,i] = (v[i]^(1/3)+(vp[j]^(1/3)))^2 *((v[i]^(2/9)+(vp[j]^(2/9)))^(1/2))*2^(1/3); 
            @inbounds g[j,i] = exp(-(((v[i]*vp[j])^(1/3))/(v[i]^(1/3)+(vp[j]^(1/3))))^4);
        end
        N = N+1;
    end
    f = Symmetric(f);
    g = Symmetric(g);
    return f, g
end

which gives:

293.099 ms (10 allocations: 22.89 MiB)

Better, but still 3x slower than Matlab.

I would really appreciate any thoughts on this.

3 Likes

Sure it is. That’s broadcasting! Just use .+, or simply

f = @. (v^(1/3)+(vp^(1/3))')^2 * ((v^(2/9)+(vp^(2/9))')^(1/2))

However, this is one of those cases where loop fusion may hurt you (see also #20875) — you end up computing things like v^(1/3) N² times instead of N times, whereas Matlab’s non-fused vector operations compute these powers only once per element. Your nested-loop implementation has the same problem (but it’s even more obvious there because the loops are explicit) — you are computing all of your v and vp powers N² times.

So it might be faster to hoist those computations out of the inner loops, even at the expense of allocating some temporary arrays:

function product3(v, vp)
    v_1_3 = cbrt.(v)
    vp_1_3 = cbrt.(vp)
    v_2_9 = cbrt.(v_1_3).^2
    vp_2_9 = cbrt.(vp_1_3).^2
    f = @. (v_1_3 + vp_1_3')^2 * sqrt(v_2_9 + vp_2_9')
    g = @. exp(-((v_1_3 * vp_1_3') / (v_1_3 + vp_1_3'))^4)
    return f, g
end

(Note that I would tend to use sqrt and cbrt rather than ^(1/2) and ^(1/3), respectively.
Note also that your Julia implementation also includes a * 2^(1/3) factor that is not in your Matlab code.)

On my machine, this is 5x faster than your initial two-loop version:

julia> @btime product($x,$x);
  202.658 ms (5 allocations: 15.26 MiB)

julia> @btime product3($x,$x);
  42.112 ms (21 allocations: 15.29 MiB)

By the way, is Matlab using multiple threads? Try running Matlab with -singleCompThread to see what its single-threaded performance is, so that you are comparing apples to apples.

(Conversely, you can use threads in Julia too with Threads.@threads for ... end, but it’s good to optimize the single-threaded performance before you start messing with parallelism.)

11 Likes

I was able to save another 2 ms (5%) on my machine by writing out the loops in order to eliminate the vp_1_3 and vp_2_9 allocations:

function product4(v, vp)
    vp_1_3 = cbrt.(vp)
    vp_2_9 = cbrt.(vp_1_3).^2
    # compute the output type rather than hard-coding Float64, to be more type-generic
    f = Matrix{typeof(zero(eltype(vp_1_3)) + cbrt(zero(eltype(v))))}(undef, length(vp),length(v))
    g = similar(f)
    @inbounds for i in eachindex(v)
        v_1_3 = cbrt(v[i]); v_2_9 = cbrt(v_1_3)^2
        for j in eachindex(vp)
            f[j,i] = (v_1_3 + vp_1_3[j])^2 * sqrt(v_2_9 + vp_2_9[j])
            g[j,i] = exp(-((v_1_3 * vp_1_3[j]) / (v_1_3 + vp_1_3[j]))^4)
        end
    end
    return f, g
end

(and of course you can still save another factor of 2 by specializing for the symmetric case).

7 Likes

This also gets a nice boost from LoopVectorization. (Although perhaps this should come with a warning that it’s slightly experimental, unlike the above. For instance it appears not to like powers, right now.)

julia> using LoopVectorization

julia> function product6(v, vp)
           vp_1_3 = cbrt.(vp)
           vp_2_9 = cbrt.(vp_1_3).^2
           f = Matrix{Float64}(undef, length(vp),length(v))
           g = similar(f)
           @avx for i in eachindex(v)
               v_1_3 = cbrt(v[i])
               c = cbrt(v_1_3)
               v_2_9 = c * c
               for j in eachindex(vp)
                       d = (v_1_3 + vp_1_3[j])
                   f[j,i] = d * d * sqrt(v_2_9 + vp_2_9[j])
                   a = (v_1_3 * vp_1_3[j]) / (v_1_3 + vp_1_3[j])
                   b = a * a
                   g[j,i] = exp(-b * b)
               end
           end
           return f, g
       end

julia> @btime product4($x, $x); # stevengj's function above
  40.186 ms (7 allocations: 15.27 MiB)

julia> @btime product6($x, $x); # using LoopVectorization
  3.881 ms (7 allocations: 15.27 MiB)

julia> product4(x, x)[1] ≈ product6(x, x)[1]
true

In fact, Julia’s own vectorisation gets you half-way there. Writing out the powers (b = a * a etc.) but just using @inbounds instead of @avx, I get this:

julia> @btime product8($x, $x);
  13.986 ms (7 allocations: 15.27 MiB)
7 Likes

Hey, thanks a lot.

Sure it is. That’s broadcasting! Just use .+ ,

I must say, I am surprised… I just assumed that .+ would do the elementwise sum, not the entire matrix! Great to know!

However, this is one of those cases where loop fusion may hurt you (see also #20875) — you end up computing things like v^(1/3) N² times instead of N times, whereas Matlab’s non-fused vector operations compute these powers only once per element. Your nested-loop implementation has the same problem (but it’s even more obvious there because the loops are explicit) — you are computing all of your v and vp powers N² times.

Great! Thanks for the info!

So it might be faster to hoist those computations out of the inner loops, even at the expense of allocating some temporary arrays:

Fantastic! Just tried it out and worked perfectly (30 ms).

Thank you a lot!

1 Like

You don’t need to write out the powers, just use @fastmath. I did this on the for j loop, i.e. writing @fastmath for j in eachindex(vp), and I got

julia> @btime product4($x,$x);
  13.455 ms (7 allocations: 15.27 MiB)

The reason for this is that computing x^4 by (x^2)^2 is slightly less accurate, so it is not done by default. @fastmath switches over to faster (but less accurate) small-integer powers. (See also the discussion in #20637.)

3 Likes

Anyone know why the @avx macro chokes on literal powers? Is it just ‘not implemented yet’?

mcabbot filed this issue:

https://github.com/chriselrod/LoopVectorization.jl/issues/85

It is a case of “not implemented yet”, but it’s also not the trivial problem of needing to add missing methods. LoopVectorization will need special-case handling for ^, just like Julia’s lowering has.

3 Likes

Did you try it with @fastmath as noted above?

I added support in LoopVectorization 0.6.29, so it should work now. Please file new issues if any more problems come up!

That @fastmath produces pow_fast(x, ::Val{p}) is neat:

ulia> @macroexpand @fastmath a ^ 2
:(Base.FastMath.pow_fast(a, Val{2}()))

I implemented the special handling of ^ to directly insert the power-by-squaring if the exponent is a literal, to improve cost modeling. That one pow_fast has a different cost to another would require a larger change to the package.
a ^ b is expensive in terms of time and register use (you probably wouldn’t want to unroll the loops at all), but a ^ 0, a ^ 1, and a ^ 2 are cheap, and a ^ -1 isn’t all that expensive (and doesn’t cost any extra registers).

6 Likes

I tried out the new Threads.@spawn functionality, and made a special implementation for the symmetric case:


using Base.Threads: @spawn, @sync
using LinearAlgebra: Symmetric

function tproduct(x, y)
    Y3 = cbrt.(y)
    Y9 = cbrt.(Y3).^2
    H = Matrix{eltype(x)}(undef, length(x), length(y))
    @fastmath @sync for i in eachindex(x)
        @spawn begin
            x3 = cbrt(x[i])
            x9 = cbrt(x3)^2
            for j in eachindex(y)
                A = x3 + Y3[j]
                B = x3 * Y3[j]
                C = x9 + Y9[j]
                f = A^2 * sqrt(C)
                g = exp(-(B / A)^4)
                H[j, i] = f * g
            end
        end
    end
    return H
end

# symmetric case
function tproduct(x)
    X3 = cbrt.(x)
    X9 = cbrt.(X3).^2
    H = Matrix{eltype(x)}(undef, length(x), length(x))
    @fastmath @sync for i in eachindex(x)
        @spawn begin
            x3 = X3[i]
            x9 = X9[i]
            for j in firstindex(x):i
                A = x3 + X3[j]
                B = x3 * X3[j]
                C = x9 + X9[j]
                f = A^2 * sqrt(C)
                g = exp(-(B / A)^4)
                H[j, i] = f * g
            end
        end
    end
    return Symmetric(H)
end

Comparing it to the @avx version, using 4 threads:

julia> @btime product6(x,x) setup=(x=rand(1000));
  6.499 ms (4 allocations: 7.64 MiB)

julia> @btime tproduct(x, x) setup=(x=rand(1000));
  3.689 ms (6015 allocations: 8.36 MiB)

julia> @btime tproduct(x) setup=(x=rand(1000));
  1.990 ms (6016 allocations: 8.35 MiB)

The last, symmetric, case is almost 30x as fast as the Matlab code, also with 4 threads, though the Matlab code does not exploit symmetry. Using @spawn rather than @threads is advantageous for the symmetric case, especially.

BTW: how’s the interaction between the @avx macro and multithreading? Can they co-exist, or does @avx already exploit multiple threads?

4 Likes

Yep, @avx plays nice with threads. Here’s an example on 6 threads:

using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

function tproduct_avx(x, y)
    @fastmath begin
        Y3 = cbrt.(y)
        Y9 = cbrt.(Y3).^2
    end
    lx = length(x)
    ly = length(y)
    H = Matrix{eltype(x)}(undef, lx, ly)
    
    @sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads()) 
        @spawn begin
            @avx for i ∈ x_index_chunk
                x3 = cbrt(x[i])
                x9 = cbrt(x3)^2
                for j in eachindex(y)
                    A = x3 + Y3[j]
                    B = x3 * Y3[j]
                    C = x9 + Y9[j]
                    f = A^2 * sqrt(C)
                    g = exp(-(B/A)^4)
                    H[j, i] = f * g
                end
            end
        end
    end
    return H
end
julia> nthreads()
6

jullia> @btime tproduct(x, x) setup=(x=rand(1000));  
1.821 ms (6015 allocations: 8.41 MiB)

julia> @btime tproduct_avx(x, x) setup=(x=rand(1000));
761.404 μs (100 allocations: 7.66 MiB)

LoopVectorization.jl doesn’t yet know how to do triangular loops so doing the symmetric case would require some gymnastics currently, but this is on the issue tracker as something that is intended to be supported eventually Support for 'triangular' loops · Issue #77 · JuliaSIMD/LoopVectorization.jl · GitHub


Edit: I changed an error in the implementation of tproduct_avx

3 Likes

Very cool.

Unfortunately, it doesn’t yet handle ‘triangular’ iteration, as far as I understand, so we can’t leverage the advantages of the symmetric case. Edit: what you said, I mean.

Okay, so it turns out that for this case, one doesn’t really need to care about vectorizing the outer loop, so this implementation of the Symmetric case with LoopVectorization is able to get the best of both worlds:

using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

function tproduct_avx(x)
    X3 = @avx cbrt.(x)
    X9 = @avx cbrt.(X3).^2
    lx = length(x)
    H = Matrix{eltype(x)}(undef, lx, lx)
    @sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads()) 
        @spawn begin
            for i ∈ x_index_chunk
                x3 = X3[i]
                x9 = X9[i]
                js = firstindex(x):i
                @avx for j in js
                    A = x3 + X3[j]
                    B = x3 * X3[j]
                    C = x9 + X9[j]
                    f = A^2 * sqrt(C)
                    g = exp(-(B/A)^4)
                    H[j, i] = f * g
                end
            end
        end
    end
    return Symmetric(H)
end
julia> @btime tproduct(x) setup=(x=rand(1000));
1.041 ms (6016 allocations: 8.39 MiB)

julia> @btime tproduct_avx(x) setup=(x=rand(1000));
404.577 μs (88 allocations: 7.66 MiB)

Hats off to @Elrod, the man is a wizard.

13 Likes

boy, that escalated quickly

15 Likes

One question, though. What’s the purpose of the explicit chunking? I’m getting better performance (>10%) with just ordinary eachindex:


function tproduct(x)
    X3 = cbrt.(x)
    X9 = cbrt.(X3).^2
    H = Matrix{eltype(x)}(undef, length(x), length(x))
    @sync for i in eachindex(x)
        @spawn begin
            x3 = X3[i]
            x9 = X9[i]
            @avx for j in firstindex(x):i
                A = x3 + X3[j]
                B = x3 * X3[j]
                C = x9 + X9[j]
                f = A^2 * sqrt(C)
                g = exp(-(B / A)^4)
                H[j, i] = f * g
            end
        end
    end
    return Symmetric(H)
end

Edit: BTW, I’ve realized that, even though there are eachindex and firstindex in here, this won’t work too well for inputs with non-standard indexing, since I’m using the ordinary Matrix type, and using the same indices for inputs and outputs. So the code is not as generic as it looks.

Because the early iterations of for i in eachindex(x) are much faster than the later iterations, evenly splitting them up is probably not the way to go. However, that still seems to be a lot faster on my computer than @spawn on every iteration.
Your tproduct:

julia> @benchmark tproduct($x)
BenchmarkTools.Trial:
  memory estimate:  8.39 MiB
  allocs estimate:  6015
  --------------
  minimum time:     744.784 μs (0.00% GC)
  median time:      831.465 μs (0.00% GC)
  mean time:        1.072 ms (16.54% GC)
  maximum time:     2.916 ms (69.94% GC)
  --------------
  samples:          4654
  evals/sample:     1

This is hardly faster than not using threads at all:

julia> @benchmark tproduct_nothreads($x)
BenchmarkTools.Trial:
  memory estimate:  7.65 MiB
  allocs estimate:  5
  --------------
  minimum time:     833.721 μs (0.00% GC)
  median time:      855.760 μs (0.00% GC)
  mean time:        1.016 ms (8.90% GC)
  maximum time:     2.251 ms (26.65% GC)
  --------------
  samples:          4912
  evals/sample:     1

while Mason’s was much faster, thanks to the lower overhead of few @spawns:

julia> @benchmark tproduct_avx($x)
BenchmarkTools.Trial:
  memory estimate:  7.67 MiB
  allocs estimate:  240
  --------------
  minimum time:     85.435 μs (0.00% GC)
  median time:      100.729 μs (0.00% GC)
  mean time:        188.287 μs (43.70% GC)
  maximum time:     901.825 μs (76.78% GC)
  --------------
  samples:          10000
  evals/sample:     1

We can improve on this by trying to address the issue of uneven numbers of iterations on the inner loop. Basically, we want to partition the outer loop into nthreads() pieces that have similar numbers of total inner loop iterations.
This means that the early outer loop chunks must be longer than the later outer loop chunks.

The total number of triangular iterations is div(N * (N + 1), 2).

julia> binomial(length(x) + 1,2)
500500

julia> bin2p1(x::Integer) = (x * (x + 1)) >>> 1
bin2p1 (generic function with 2 methods)

julia> bin2p1(1000)
500500

We can invert this and solve.

julia> outer_iter(n) = round(Int, @fastmath 0.5sqrt(1 + 8n)-0.5  )
outer_iter (generic function with 1 method)

julia> @. outer_iter(500500 * (0:8) ÷ 8)
9-element Array{Int64,1}:
    0
  353
  500
  612
  707
  790
  866
  935
 1000

Now these are boundaries for the outer loop that should give roughly similar numbers of total loop iterations, if we were to use 8 chunks and length(x) == 1000.

Putting this together:

using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra: Symmetric
using LoopVectorization: @avx

outer_iter(n) = round(Int, @fastmath 0.5sqrt(1 + 8n)-0.5  )
bin2p1(x::AbstractVector) = bin2p1(length(x))
bin2p1(x::Integer) = (x * (x + 1)) >>> 1

function tproduct_threads(x)
    X3 = @avx cbrt.(x)
    X9 = @avx cbrt.(X3).^2
    H = Matrix{eltype(x)}(undef, length(x), length(x))
    niter = bin2p1(length(x))
    nchunks = Threads.nthreads()
    @sync for tid in 1:nchunks
        @spawn begin
            start = outer_iter(((tid - 1) * niter) ÷ nchunks) + 1
            stop = outer_iter((tid * niter) ÷ nchunks)
            for i in start:stop
                x3 = X3[i]
                x9 = X9[i]
                @avx for j in firstindex(x):i
                    A = x3 + X3[j]
                    B = x3 * X3[j]
                    C = x9 + X9[j]
                    f = A^2 * sqrt(C)
                    g = exp(-(B / A)^4)
                    H[j, i] = f * g
                end
            end
        end
    end
    return Symmetric(H)
end

This yields

julia> @benchmark tproduct_threads($x)
BenchmarkTools.Trial:
  memory estimate:  7.66 MiB
  allocs estimate:  119
  --------------
  minimum time:     57.702 μs (0.00% GC)
  median time:      70.051 μs (0.00% GC)
  mean time:        234.578 μs (46.76% GC)
  maximum time:     1.574 ms (56.24% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 1.5.0-DEV.585
Commit 4213cc651f* (2020-04-07 08:45 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 18
2 Likes

Weird. I saw a 3x speedup from four threads, and also tproduct was 10% faster than tproduct_avx on my laptop.

Rerunning on Julia 1.4, where threading is in drastically better shape than Julia master, and using @avx for the cbrt broadcasts in each version:

julia> x = rand(1000);

julia> @benchmark tproduct_threads($x)
BenchmarkTools.Trial:
  memory estimate:  7.66 MiB
  allocs estimate:  119
  --------------
  minimum time:     51.276 μs (0.00% GC)
  median time:      82.728 μs (0.00% GC)
  mean time:        296.117 μs (49.80% GC)
  maximum time:     1.547 ms (64.31% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark tproduct_avx($x)
BenchmarkTools.Trial:
  memory estimate:  7.67 MiB
  allocs estimate:  240
  --------------
  minimum time:     62.949 μs (0.00% GC)
  median time:      97.996 μs (0.00% GC)
  mean time:        177.327 μs (45.42% GC)
  maximum time:     899.084 μs (81.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark tproduct($x)
BenchmarkTools.Trial:
  memory estimate:  8.39 MiB
  allocs estimate:  6015
  --------------
  minimum time:     566.186 μs (0.00% GC)
  median time:      681.781 μs (0.00% GC)
  mean time:        915.924 μs (19.56% GC)
  maximum time:     2.739 ms (43.76% GC)
  --------------
  samples:          5444
  evals/sample:     1

julia> @benchmark tproduct_nothreads($x)
BenchmarkTools.Trial:
  memory estimate:  7.65 MiB
  allocs estimate:  5
  --------------
  minimum time:     662.203 μs (0.00% GC)
  median time:      672.920 μs (0.00% GC)
  mean time:        740.551 μs (8.30% GC)
  maximum time:     1.306 ms (44.12% GC)
  --------------
  samples:          6727
  evals/sample:     1

Seems I’m seeing all around much better performance on Julia 1.4 than Julia master. [EDIT: that is because I had configured SLEEFPirates differently on Julia master; reverting to the default results in roughly equivalent single threaded performance.]
But I am still seeing comparable performance between single threaded, and 18 threads with an @spawn on each iteration.

Which, given that we have 1000 iterations, and the overhead of @spawn:

julia> @benchmark @spawn nothing
BenchmarkTools.Trial:
  memory estimate:  719 bytes
  allocs estimate:  4
  --------------
  minimum time:     485.450 ns (0.00% GC)
  median time:      572.517 ns (0.00% GC)
  mean time:        615.971 ns (6.23% GC)
  maximum time:     4.161 μs (77.13% GC)
  --------------
  samples:          9812
  evals/sample:     825

means it seems like it’s to be expected. 1000 @spawns look like they’d take at least 485 microseconds. Even with just 18 @spawns, they’d make up about 20% of the runtime of the 51.276 μs minimum time from tproduct_threads. Minimizing the number of @spawns seems key.

But that’s just because of how fast this code is. If it was taking milliseconds, we wouldn’t really care about half a microsecond.

2 Likes

that’s interesting, I see 648.043 μs (6016 allocations: 8.39 MiB) on your version and 401.126 μs (88 allocations: 7.66 MiB) on mine. Of course, when I quickly whipped that code up, I didn’t think about the fact that each iteration of the triangular loop takes a different amount of time, so Chris is right that it would be best to be more intelligent about how the threads are spawned.

What sort of processor does your laptop have? Maybe the difference is that it’s unable to take advantage of any AVX instructions?

Interestingly, your code is actually slightly slower than my loop naive code on on my machine:

julia> @btime tproduct_threads(x)  setup=(x=rand(1000));
  430.115 μs (45 allocations: 7.65 MiB)

but I suspect that the differences matter more on your system because the code is running so much faster on it and you have so many more threads available. My code just spawns twice as many tasks as the number of threads and hopes for the best, which probably isn’t so great if you have 18 threads.