I am trying to perform a complex mathematical operation over the elements of two array. It’s a kind of broadcasting operation. I am trying to achieving the performance gain from threads.@spawn and @simd at the same time. In my code inner loop uses the @simd and the outer loop uses the Thread.@spawn. Its look that I am not achieving any significant gain from the multithreading. I am attaching my code here

```
julia> using Base.Threads,BenchmarkTools
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
julia> function complex_function!(X::StridedVector{Float64},Y::StridedVector{Float64},Z::StridedMatrix{Float64},k::Int64,l::Int64,m::Int64)
for j in k:l
@simd for i in 1:m
@inbounds Z[i,j]=-(X[j]-Y[i])*0.01/((X[j]-Y[i])^2+0.0001)
end
end
end
julia> function thread_spwan!(A::StridedVector{Float64},B::StridedVector{Float64},C::StridedMatrix{Float64},n::Int64,m::Int64,lo::Int64=1,hi::Int64=n,ntasks=3000)
if hi - lo > n/ntasks-1
mid=(hi+lo)>>>1
finish = Threads.@spawn thread_spwan!(A,B,C,n,m,lo,mid,ntasks)
thread_spwan!(A,B,C,n,m,mid+1,hi,ntasks)
wait(finish)
return
end
complex_function!(A,B,C,lo,hi,m)
end
julia> a=rand(20000);
julia> b=rand(20000);
julia> c=fill(0.0,(20000,20000));
julia> @btime complex_function!(a,b,c,1,20000,20000);
256.195 ms (0 allocations: 0 bytes)
julia> @btime thread_spwan!(a,b,c,20000,20000);
229.582 ms (24576 allocations: 3.19 MiB)
```

Also I am not getting that much details from `@code_llvm thread_spwan!(a,b,c,20000,20000)`

where `@code_llvm complex_function!(a,b,c,1,20000,20000)`

shows a lot of details to me.