function parallel_spawn_t(y,x,lo=1,hi=length(x)-1)
if hi-lo < n/16-1
a = 1.2f0
b = 1-a
for i=lo:hi
@inbounds y[i] = x[i]a + x[i+1]b
return nothing
mid = (lo+hi)>>>1
finish = Threads.@spawn parallel_spawn_t(y,x,lo,mid)
Here’s how the three versions compare on my compute node:
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)
julia> @btime sequential_t($y,$x)
6.051 ms (0 allocations: 0 bytes)
julia> @btime parallel_t($y,$x)
3.264 ms (58 allocations: 6.00 KiB)
julia> @btime parallel_spawn_t($y,$x)
1.570 ms (256 allocations: 23.69 KiB))
Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks
julia> using LoopVectorization
help?> vmapntt!
Couldn't find vmapntt!
Perhaps you meant vmap! or vmap
No documentation found.
Binding vmapntt! does not exist.
I think you should try restarting Julia. I just updated to the latest release for LoopVectorization and the function is exported and available for use.
Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks
@wizebt The memory I have in my machine is Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600
The 3950X also has 64 MB of total L3 cache. Would you mind benchmarking vmapntt!?
It should avoid writing y to your cache. The entirety of x should then fit in L3.
Sure, I copied your code and ran it. Code with output is here:
using BenchmarkTools
using LoopVectorization
function sequential_t(y,x)
a = 1.2f0
b = 1-a
for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
return nothing
function parallel_t(y,x)
a = 1.2f0
b = 1-a
Threads.@threads for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
return nothing
f(x, y) = 1.2f0x - 0.2f0y
Thanks. I was hoping for a much bigger improvement.
The fastest chips/setups here have been those with 4 or more memory channels. The 3950x, with only 2, is at a disadvantage there, but I was hoping that by obviating the need to read from RAM at all it would blow past them.
IIRC, on Zen2 the L3 is only shared within CCXes (sets of 4 cpres), meaning it has 4x 16MiB rather than one large, monolithic, 64 MiB cache.
The @threads macro uses the thread id to assign a chunk of the iteration space to a specific thread, so the chunk of x a thread iterates over will be the same across benchmark runs.
From there, it’s the OS’s job to keep that thread pinned to the same physical core.
You (or anyone else) have an idea if this or something else may be the problem?
No clue. I just built this PC in January and chose the Ryzen 9 3950X because of all the cores/threads. It was much more of a splurge than it was a need, I really don’t have much code that I’m running in parallel.