Threads.@spawn performance

I am trying to perform a complex mathematical operation over the elements of two array. It’s a kind of broadcasting operation. I am trying to achieving the performance gain from threads.@spawn and @simd at the same time. In my code inner loop uses the @simd and the outer loop uses the Thread.@spawn. Its look that I am not achieving any significant gain from the multithreading. I am attaching my code here

julia> using Base.Threads,BenchmarkTools
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
julia> function complex_function!(X::StridedVector{Float64},Y::StridedVector{Float64},Z::StridedMatrix{Float64},k::Int64,l::Int64,m::Int64)
     for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001)
        end
    end
end
julia> function thread_spwan!(A::StridedVector{Float64},B::StridedVector{Float64},C::StridedMatrix{Float64},n::Int64,m::Int64,lo::Int64=1,hi::Int64=n,ntasks=3000)
    if hi - lo > n/ntasks-1
        mid=(hi+lo)>>>1
        finish = Threads.@spawn thread_spwan!(A,B,C,n,m,lo,mid,ntasks)
        thread_spwan!(A,B,C,n,m,mid+1,hi,ntasks)
        wait(finish)
        return
    end
    complex_function!(A,B,C,lo,hi,m)
end
julia> a=rand(20000);
julia> b=rand(20000);
julia> c=fill(0.0,(20000,20000));
julia> @btime complex_function!(a,b,c,1,20000,20000);
  256.195 ms (0 allocations: 0 bytes)
julia> @btime thread_spwan!(a,b,c,20000,20000);
  229.582 ms (24576 allocations: 3.19 MiB)

Also I am not getting that much details from @code_llvm thread_spwan!(a,b,c,20000,20000) where @code_llvm complex_function!(a,b,c,1,20000,20000) shows a lot of details to me.

Hi, I tried out some stuff, but couldn’t get much faster either. Unfortunately I dont have the time to think into optimizations. But: 1) You can save yourself from a bit of writing by using the Threads.@threads macro for automatic loop range partition. 2) Take a look at Tullio.jl or Strided.jl.

1 Like

Hi, Thanks for referring to these tools. In the first place, my motivation is to write an optimized multi-threaded code and understand what is going on under the hood. Initially, I looked to the Tullio.jl while writing my code (as much as I can understand from the beginner perspective). As I mentioned earlier, I am not getting that much information from @code_llvm thread_spwan!(a,b,c,20000,20000) maybe there are some tools for this purpose.

I think your function is just too simple to benefit much from threading, it’s mostly waiting for memory.

The recursive spawning you are doing is actually pretty similar to what Tullio does. One difference is that it won’t launch more than Threads.nthreads() tasks, but once it’s used those up, it continues to recursively sub-divide, because this speeds up memory access in many problems. Although not in this one. Here’s how you could (as an exercise) call its threading routine on your kernel:

@inline function complex_function!(::Type, Z, X, Y, is::AbstractRange, js::AbstractRange, _, _)
     for j in js
        @simd for i in is
            @inbounds Z[i,j]=-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001)
        end
    end
end
function thread_run!(C, A, B, is=eachindex(B), js=eachindex(A); block=10^6)
    # No guarantee this function won't change! But right now:
    Tullio.threader(complex_function!, Array, C, (A,B), (is, js), (), +, block)
    C
end
using Tullio

a = rand(10); b = rand(10); c = zeros(10,10);

thread_run!(c, a, b)

c ≈ @tullio Z[i,j] := -(a[j]-b[i])*0.01/((a[j]-b[i])^2+0.0001) # true

a=rand(20000); b=rand(20000); c=fill(0.0,(20000,20000)); # original size

@btime complex_function!($a,$b,$c,1,20000,20000); # 280.252 ms (0 allocations: 0 bytes)
@btime thread_spwan!($a,$b,$c,20000,20000);       # 255.132 ms (24585 allocations: 3.00 MiB)
@btime thread_run!($c, $a,$b);                    # 263.643 ms (45 allocations: 2.88 KiB)
Tullio.TILE[] = 20_000 * 8; # huge tiles to disable further sub-division
@btime thread_run!($c, $a,$b);                    # 246.462 ms (44 allocations: 2.84 KiB)
2 Likes

ntasks=3000 sounds large. It looks like your problem doesn’t require load-balancing so there is not much point in spawning more than Threads.nthreads() tasks. Why not use ntasks=Threads.nthreads()?

Thinking about it, I would second that statement. A quick check with a more expensive loop body and 1000x1000:

function complex_function!(X,Y,Z,k,l,m)
    for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=log(exp(sin(-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001))^2)+1)
        end
    end
end
function complex_function_thr!(X,Y,Z,k,l,m)
    Threads.@threads for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=log(exp(sin(-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001))^2)+1)
        end
    end
end

@btime complex_function!($a,$b,$c, 1, $n, $n)  # 33ms
@btime complex_function_thr!($a,$b,$c, 1, $n, $n)  # 4ms -> much faster using 8 threads

Given that threading doesn’t help much in the memory-bound case you could rewrite your code more concise:

...
foo(x,y) = -(x-y)*0.01 / ((x-y)^2 + 1E-4)
@. c = foo(a',b)

which is equal to your complex_function!. @. is the macro for automatic broadcasting and the transposed vector a' makes the thing a matrix. The compiler takes care of SIMD, inbounds, inlining etc in this simple case, so no need for verbose annotations. If you want to take a closer look at what macros do, you can use the @macroexpand macro :slight_smile:

1 Like

It’s working fine. It is giving me a slightly better performance over my handwritten complex_function!. For some simple addition or subtraction my complex_function! gives me a better result than the map or broadcast but I never used @. macro rather I tried

c=broadcast(foo,a',b) 

The difference between @. c = foo(a',b) and c=broadcast(foo,a',b) is…

c .= foo.(a',b)  # result of @.
c = foo.(a',b)   # equivalent of broadcast

… the tiny dot in front of =. In the former case the result of foo is element-wise written into the already existing array c, so no allocations happen here. In the latter case, foo allocates a new matrix, writes into it and then the result is assigned to c, no matter what c was beforehand. So the tiny dot makes the difference between the slow allocating and the fast non-allocating version here - and writing the ~3.2GB in this case takes time. Note that your program should be faster if you use Float32, given you can live with less bits.