For spectnorm, I get 0.457363 seconds for 16 threads, only 11% slower than the C program (assuming it uses all 16 threads I have).
For 4 threads I get 64% slower, not 4x slower, so not scaling well (neither I guess the C program):
$ time julia -t4 -O2 --cpu-target=core2
julia> @timev main(5500);
0.753225 seconds (979 allocations: 246.938 KiB)
elapsed time (ns): 753225326
bytes allocated: 252864
pool allocs: 976
malloc() calls: 3
Looking at the profile of the program, assuming I’m reading correctly, it seems the single “line” taking the most time is line 8:
Base.:+(a::m128d, b::m128d) =
Base.llvmcall("""
%res = fadd <2 x double> %0, %1
ret <2 x double> %res
""", m128d, Tuple{m128d, m128d}, a, b)
but “line 13”, just a small amount:
Base.:/(a::m128d, b::m128d) =
Base.llvmcall("""
%res = fdiv <2 x double> %0, %1
ret <2 x double> %res
""", m128d, Tuple{m128d, m128d}, a, b)
That seems very odd, wouldn’t division be slower? [Probably the profiler is inaccurate, with the latency long, but CPU getting to issue other instructions quickly after, before getting answer out.]
[EDIT:
It’s helpful to disable threading, to see what’s going on, otherwise, threading seems to take the single most time, also good no know about:
ProfileView.view(C=false, fontsize=40)
The rest of my post is answered, thanks, I assume, yes, jut not for the machine he benchmark runs on.]
And can you vectorize to do 4, 8 etc. additions (or divisions) at a time?
I get no error changing “<2 x double>” to “<4 x double>”, or even trying 160, but I didn’t run either using those definitions.