Threads.@spawn performance

j_tanu · October 20, 2020, 12:43pm

I am trying to perform a complex mathematical operation over the elements of two array. It’s a kind of broadcasting operation. I am trying to achieving the performance gain from threads.@spawn and @simd at the same time. In my code inner loop uses the @simd and the outer loop uses the Thread.@spawn. Its look that I am not achieving any significant gain from the multithreading. I am attaching my code here

julia> using Base.Threads,BenchmarkTools
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4
julia> function complex_function!(X::StridedVector{Float64},Y::StridedVector{Float64},Z::StridedMatrix{Float64},k::Int64,l::Int64,m::Int64)
     for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001)
        end
    end
end
julia> function thread_spwan!(A::StridedVector{Float64},B::StridedVector{Float64},C::StridedMatrix{Float64},n::Int64,m::Int64,lo::Int64=1,hi::Int64=n,ntasks=3000)
    if hi - lo > n/ntasks-1
        mid=(hi+lo)>>>1
        finish = Threads.@spawn thread_spwan!(A,B,C,n,m,lo,mid,ntasks)
        thread_spwan!(A,B,C,n,m,mid+1,hi,ntasks)
        wait(finish)
        return
    end
    complex_function!(A,B,C,lo,hi,m)
end
julia> a=rand(20000);
julia> b=rand(20000);
julia> c=fill(0.0,(20000,20000));
julia> @btime complex_function!(a,b,c,1,20000,20000);
  256.195 ms (0 allocations: 0 bytes)
julia> @btime thread_spwan!(a,b,c,20000,20000);
  229.582 ms (24576 allocations: 3.19 MiB)

Also I am not getting that much details from @code_llvm thread_spwan!(a,b,c,20000,20000) where @code_llvm complex_function!(a,b,c,1,20000,20000) shows a lot of details to me.

maxfreu · October 20, 2020, 2:38pm

Hi, I tried out some stuff, but couldn’t get much faster either. Unfortunately I dont have the time to think into optimizations. But: 1) You can save yourself from a bit of writing by using the Threads.@threads macro for automatic loop range partition. 2) Take a look at Tullio.jl or Strided.jl.

j_tanu · October 20, 2020, 4:34pm

Hi, Thanks for referring to these tools. In the first place, my motivation is to write an optimized multi-threaded code and understand what is going on under the hood. Initially, I looked to the Tullio.jl while writing my code (as much as I can understand from the beginner perspective). As I mentioned earlier, I am not getting that much information from @code_llvm thread_spwan!(a,b,c,20000,20000) maybe there are some tools for this purpose.

mcabbott · October 20, 2020, 6:05pm

I think your function is just too simple to benefit much from threading, it’s mostly waiting for memory.

The recursive spawning you are doing is actually pretty similar to what Tullio does. One difference is that it won’t launch more than Threads.nthreads() tasks, but once it’s used those up, it continues to recursively sub-divide, because this speeds up memory access in many problems. Although not in this one. Here’s how you could (as an exercise) call its threading routine on your kernel:

@inline function complex_function!(::Type, Z, X, Y, is::AbstractRange, js::AbstractRange, _, _)
     for j in js
        @simd for i in is
            @inbounds Z[i,j]=-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001)
        end
    end
end
function thread_run!(C, A, B, is=eachindex(B), js=eachindex(A); block=10^6)
    # No guarantee this function won't change! But right now:
    Tullio.threader(complex_function!, Array, C, (A,B), (is, js), (), +, block)
    C
end
using Tullio

a = rand(10); b = rand(10); c = zeros(10,10);

thread_run!(c, a, b)

c ≈ @tullio Z[i,j] := -(a[j]-b[i])*0.01/((a[j]-b[i])^2+0.0001) # true

a=rand(20000); b=rand(20000); c=fill(0.0,(20000,20000)); # original size

@btime complex_function!($a,$b,$c,1,20000,20000); # 280.252 ms (0 allocations: 0 bytes)
@btime thread_spwan!($a,$b,$c,20000,20000);       # 255.132 ms (24585 allocations: 3.00 MiB)
@btime thread_run!($c, $a,$b);                    # 263.643 ms (45 allocations: 2.88 KiB)
Tullio.TILE[] = 20_000 * 8; # huge tiles to disable further sub-division
@btime thread_run!($c, $a,$b);                    # 246.462 ms (44 allocations: 2.84 KiB)

tkf · October 20, 2020, 7:55pm

ntasks=3000 sounds large. It looks like your problem doesn’t require load-balancing so there is not much point in spawning more than Threads.nthreads() tasks. Why not use ntasks=Threads.nthreads()?

maxfreu · October 21, 2020, 2:07pm

Thinking about it, I would second that statement. A quick check with a more expensive loop body and 1000x1000:

function complex_function!(X,Y,Z,k,l,m)
    for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=log(exp(sin(-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001))^2)+1)
        end
    end
end
function complex_function_thr!(X,Y,Z,k,l,m)
    Threads.@threads for j in k:l
        @simd for i in 1:m
            @inbounds Z[i,j]=log(exp(sin(-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001))^2)+1)
        end
    end
end

@btime complex_function!($a,$b,$c, 1, $n, $n)  # 33ms
@btime complex_function_thr!($a,$b,$c, 1, $n, $n)  # 4ms -> much faster using 8 threads

Given that threading doesn’t help much in the memory-bound case you could rewrite your code more concise:

...
foo(x,y) = -(x-y)*0.01 / ((x-y)^2 + 1E-4)
@. c = foo(a',b)

which is equal to your complex_function!. @. is the macro for automatic broadcasting and the transposed vector a' makes the thing a matrix. The compiler takes care of SIMD, inbounds, inlining etc in this simple case, so no need for verbose annotations. If you want to take a closer look at what macros do, you can use the @macroexpand macro

j_tanu · October 21, 2020, 4:14pm

It’s working fine. It is giving me a slightly better performance over my handwritten complex_function!. For some simple addition or subtraction my complex_function! gives me a better result than the map or broadcast but I never used @. macro rather I tried

c=broadcast(foo,a',b)

maxfreu · October 22, 2020, 6:23am

The difference between @. c = foo(a',b) and c=broadcast(foo,a',b) is…

c .= foo.(a',b)  # result of @.
c = foo.(a',b)   # equivalent of broadcast

… the tiny dot in front of =. In the former case the result of foo is element-wise written into the already existing array c, so no allocations happen here. In the latter case, foo allocates a new matrix, writes into it and then the result is assigned to c, no matter what c was beforehand. So the tiny dot makes the difference between the slow allocating and the fast non-allocating version here - and writing the ~3.2GB in this case takes time. Note that your program should be faster if you use Float32, given you can live with less bits.

Topic		Replies	Views
More threads, slower code, even if not spawning them Performance	19	831	January 29, 2022
Confusion regarding Threads.@spawn performance General Usage	4	1064	July 19, 2020
Does @simd hang Base.Threads.@spawn? Performance multithreading	11	629	June 30, 2022
Multithreading @threads inside @spawn Performance multithreading	2	747	May 4, 2021
Gcc vs Threads.@threads vs Threads.@spawn for large loops Julia at Scale	14	2954	June 26, 2020

Threads.@spawn performance

Related topics