Hi, Schönauer vector triad again, see also the benchmarking site of Georg Hager.
This time we compare multithreading vs. scalar, and also compare to gcc.
Here ist the generating code.
We see that for scalar performance, julia shines vs. gcc -Ofast
. For multithreading performance (4 threads), the picture appears to be different. It seems that the bookkeeping overhead for handling threading is still larger compared to what we see for gcc. For small array sizes we see this overhead problem also for gcc. For large array sizes, performance is limited by memory access (it’s a laptop…).
This is the @threads based loop:
Threads.@threads for i=1:N
@inbounds @fastmath d[i]=a[i]+b[i]*c[i]
end
And this is the gist of the faster, @spawn implementation:
mapreduce(task->fetch(task),+,[Threads.@spawn _kernel(a,b,c,d,loop_begin[i],loop_end[i]) for i=1:ntasks])
A very similar picture one finds in a 2013 post by G. Hager on intel vs. gnu compiler:
He also suspects barrier performance for gcc to be the reason for the performance difference.
I am aware of the fact that it is still early times for multithreading in Julia, and things are clearly marked as experimental, so this post is meant as an encouragement to continue the endeavour of trying to be on par with C performance-wise (and thanks anyway for starting this!)