V1.3.1 no gain using multithread

One can get even better scaling using @spawn:

function parallel_spawn_t(y,x,lo=1,hi=length(x)-1)
	n=length(x)-1
	if hi-lo < n/16-1
		n=length(x)-1
		a = 1.2f0
		b = 1-a
		for i=lo:hi 
                @inbounds y[i] = x[i]a + x[i+1]b 
                end
		return nothing
	end
	mid = (lo+hi)>>>1
	finish = Threads.@spawn parallel_spawn_t(y,x,lo,mid)
	parallel_spawn_t(y,x,mid+1,hi)
	wait(finish)
end

Here’s how the three versions compare on my compute node:

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 8

julia> @btime sequential_t($y,$x)
  6.051 ms (0 allocations: 0 bytes)

julia> @btime parallel_t($y,$x)
  3.264 ms (58 allocations: 6.00 KiB)

julia> @btime parallel_spawn_t($y,$x)
  1.570 ms (256 allocations: 23.69 KiB))
1 Like

Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks

Quite interesting I will study your snippet carefully.
On mine just a little bit faster

julia> @btime parallel_spawn_t($y,$x)
  5.675 ms (272 allocations: 23.94 KiB)

Conclusion I’m hitting the memory bandwidth of my old laptop :frowning:

I am not seeing an improvement.

julia> @btime parallel_t($y, $x)
734.471 μs (123 allocations: 17.14 KiB)

julia> @btime parallel_spawn_t($y2, $x)
890.116 μs (188 allocations: 22.63 KiB)

julia> @btime vmapntt!(f, $yv, $xv1, $xv2);
468.208 μs (123 allocations: 17.16 KiB)

julia> versioninfo()
Julia Version 1.5.0-DEV.380
Commit a523fcf (2020-03-01 22:55 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)                                                                                                                                                                                                                                                                                                                                                                                                       Environment:
  JULIA_NUM_THREADS = 24

Eager to learn on vmapnt! I searched for help but coudn’t find it
here is my status

[(v1.3) pkg> up
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/.julia/environments/v1.3/Project.toml`
  [31c24e10] ↑ Distributions v0.22.4 ⇒ v0.22.5
  [bdcacae8] ↑ LoopVectorization v0.6.19 ⇒ v0.6.20
  [91a5bcdd] ↑ Plots v0.29.2 ⇒ v0.29.4
  Updating `~/.julia/environments/v1.3/Manifest.toml`
  [31c24e10] ↑ Distributions v0.22.4 ⇒ v0.22.5
  [28b8d3ca] ↑ GR v0.46.0 ⇒ v0.47.0
  [bdcacae8] ↑ LoopVectorization v0.6.19 ⇒ v0.6.20
  [91a5bcdd] ↑ Plots v0.29.2 ⇒ v0.29.4

help?> vmapntt!
search:

Couldn't find vmapntt!
Perhaps you meant vmap! or vmap
  No documentation found.

  Binding vmapntt! does not exist.

Did you do using LoopVectorization before trying that REPL search?

julia> using LoopVectorization

help?> vmapntt!
search:

Couldn't find vmapntt!
Perhaps you meant vmap! or vmap
  No documentation found.

  Binding vmapntt! does not exist.

I think you should try restarting Julia. I just updated to the latest release for LoopVectorization and the function is exported and available for use.

You win :wink:

jjulia> 
manu@m6800:~/julia$ julia 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

help?> vmapntt!
search: vmapntt! vmapntt vmapnt! vmapnt

  No documentation found.

  LoopVectorization.vmapntt! is a Function.

  # 1 method for generic function "vmapntt!":
  [1] vmapntt!(f::F, y::AbstractArray{T,1}, args::Vararg{Any,A}) where {F, T, A} in LoopVectorization at /home/manu/.julia/packages/LoopVectorization/3Q73l/src/map.jl:52

julia> 

Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks

@wizebt The memory I have in my machine is Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600

1 Like

The 3950X also has 64 MB of total L3 cache. Would you mind benchmarking vmapntt!?
It should avoid writing y to your cache. The entirety of x should then fit in L3.

Sure, I copied your code and ran it. Code with output is here:

Code:

using BenchmarkTools
using LoopVectorization

N=10000000
x=rand(Float32,N)
y=Array{Float32}(undef,N)

function sequential_t(y,x)
    n=length(x)-1
    a = 1.2f0
    b = 1-a
    for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
    return nothing
end

function parallel_t(y,x)
    n=length(x)-1
    a = 1.2f0
    b = 1-a
    Threads.@threads for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
    return nothing
end

f(x, y) = 1.2f0x - 0.2f0y

Output:

julia> @btime sequential_t($y, $x)
  5.014 ms (0 allocations: 0 bytes)

julia> yv = @view(similar(y)[1:end-1]); xv1 = @view(x[1:end-1]);  xv2 = @view(x[2:end])
9999999-element view(::Array{Float32,1}, 2:10000000) with eltype Float32:

julia> @btime vmapnt!(f, $yv, $xv1, $xv2)
  4.834 ms (0 allocations: 0 bytes)

julia> yv ≈ @view(y[1:end-1])
true

julia> fill!(y, NaN); fill!(yv, NaN)
9999999-element view(::Array{Float32,1}, 1:9999999) with eltype Float32:

julia> @btime parallel_t($y, $x)
  2.043 ms (226 allocations: 27.31 KiB)

julia> @btime vmapntt!(f, $yv, $xv1, $xv2)
  1.943 ms (226 allocations: 27.33 KiB)

Thanks. I was hoping for a much bigger improvement.
The fastest chips/setups here have been those with 4 or more memory channels. The 3950x, with only 2, is at a disadvantage there, but I was hoping that by obviating the need to read from RAM at all it would blow past them.

IIRC, on Zen2 the L3 is only shared within CCXes (sets of 4 cpres), meaning it has 4x 16MiB rather than one large, monolithic, 64 MiB cache.
The @threads macro uses the thread id to assign a chunk of the iteration space to a specific thread, so the chunk of x a thread iterates over will be the same across benchmark runs.
From there, it’s the OS’s job to keep that thread pinned to the same physical core.

You (or anyone else) have an idea if this or something else may be the problem?

No clue. I just built this PC in January and chose the Ryzen 9 3950X because of all the cores/threads. It was much more of a splurge than it was a need, I really don’t have much code that I’m running in parallel. :wink: