@threads vs @parallel, a simple fail case for @threads


#1

I’m new to Julia. I was experimenting with simple parallel code using threads and processes. Why is

a = SharedArray{Int64,1}(4)
@time @sync @parallel for i = 1:60000000
    a[myid() - 1] = i
end

#   1.663024 seconds (70.65 k allocations: 3.814 MiB)

so much faster and efficient than

a = SharedArray{Int64,1}(4) # also using a SharedArray for fairness
@time Threads.@threads for i = 1:60000000
    a[Threads.threadid()] = i
end

#   7.977596 seconds (142.59 M allocations: 2.483 GiB, 2.39% gc time)

? It seems like the latter allocates a lot. Where does this come from?

I’ve started the REPL with env JULIA_NUM_THREADS=4 julia -p 4.


#2

First, you need to wrap your code in a function before benchmarking is meaningful. See https://docs.julialang.org/en/stable/manual/performance-tips.html .

I think maybe the @time macro is interfering with @threads in some way? I initially ran the following copied from your code.

function test2()
       a = SharedArray{Int64,1}(4) # also using a SharedArray for fairness
       @time Threads.@threads for i = 1:60000000
           a[Threads.threadid()] = i
       end
 end

This is really slow, and @code_warntype complains about a Core.Box variable. However, taking the @time out of the function fixes this. The following works:

function test1()
       a = SharedArray{Int64,1}(4)
       @sync @parallel for i = 1:60000000
           a[myid() - 1] = i
       end
 end

function test2()
       a = SharedArray{Int64,1}(4) # also using a SharedArray for fairness
       Threads.@threads for i = 1:60000000
           a[Threads.threadid()] = i
       end
       end

test1() # Warmup 
test2()

The output is

julia> @time test1()
  0.018457 seconds (1.25 k allocations: 48.188 KiB)
4-element Array{Future,1}:
 Future(2, 1, 111, #NULL)
 Future(3, 1, 112, #NULL)
 Future(4, 1, 113, #NULL)
 Future(5, 1, 114, #NULL)

julia> @time test2()
  0.011668 seconds (501 allocations: 18.484 KiB)

#3

Thank you! That might give a lead, why the allocations.

As it is often the case with such things, I realized my mistake with not wrapping it into a function after posting. But the post was pending for moderation so I couldn’t change it. :man_shrugging:

But what I wonder, is there a way to precompile the function without running it? Do I need to wrap it into a module and add __precompile__() at the top?


#4

Rather than trying to precompile by hand, you can just use https://github.com/JuliaCI/BenchmarkTools.jl to run your function repeatedly and provide a robust estimate of its actual runtime, ignoring compilation.