V1.3.1 no gain using multithread

eaubanel · March 3, 2020, 3:53pm

One can get even better scaling using @spawn:

function parallel_spawn_t(y,x,lo=1,hi=length(x)-1)
	n=length(x)-1
	if hi-lo < n/16-1
		n=length(x)-1
		a = 1.2f0
		b = 1-a
		for i=lo:hi 
                @inbounds y[i] = x[i]a + x[i+1]b 
                end
		return nothing
	end
	mid = (lo+hi)>>>1
	finish = Threads.@spawn parallel_spawn_t(y,x,lo,mid)
	parallel_spawn_t(y,x,mid+1,hi)
	wait(finish)
end

Here’s how the three versions compare on my compute node:

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 8

julia> @btime sequential_t($y,$x)
  6.051 ms (0 allocations: 0 bytes)

julia> @btime parallel_t($y,$x)
  3.264 ms (58 allocations: 6.00 KiB)

julia> @btime parallel_spawn_t($y,$x)
  1.570 ms (256 allocations: 23.69 KiB))

wizebt · March 3, 2020, 11:00pm

Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks

wizebt · March 3, 2020, 11:09pm

Quite interesting I will study your snippet carefully.
On mine just a little bit faster

julia> @btime parallel_spawn_t($y,$x)
  5.675 ms (272 allocations: 23.94 KiB)

Conclusion I’m hitting the memory bandwidth of my old laptop

Elrod · March 3, 2020, 11:26pm

I am not seeing an improvement.

julia> @btime parallel_t($y, $x)
734.471 μs (123 allocations: 17.14 KiB)

julia> @btime parallel_spawn_t($y2, $x)
890.116 μs (188 allocations: 22.63 KiB)

julia> @btime vmapntt!(f, $yv, $xv1, $xv2);
468.208 μs (123 allocations: 17.16 KiB)

julia> versioninfo()
Julia Version 1.5.0-DEV.380
Commit a523fcf (2020-03-01 22:55 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)                                                                                                                                                                                                                                                                                                                                                                                                       Environment:
  JULIA_NUM_THREADS = 24

wizebt · March 3, 2020, 11:45pm

Eager to learn on vmapnt! I searched for help but coudn’t find it
here is my status

[(v1.3) pkg> up
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/.julia/environments/v1.3/Project.toml`
  [31c24e10] ↑ Distributions v0.22.4 ⇒ v0.22.5
  [bdcacae8] ↑ LoopVectorization v0.6.19 ⇒ v0.6.20
  [91a5bcdd] ↑ Plots v0.29.2 ⇒ v0.29.4
  Updating `~/.julia/environments/v1.3/Manifest.toml`
  [31c24e10] ↑ Distributions v0.22.4 ⇒ v0.22.5
  [28b8d3ca] ↑ GR v0.46.0 ⇒ v0.47.0
  [bdcacae8] ↑ LoopVectorization v0.6.19 ⇒ v0.6.20
  [91a5bcdd] ↑ Plots v0.29.2 ⇒ v0.29.4

help?> vmapntt!
search:

Couldn't find vmapntt!
Perhaps you meant vmap! or vmap
  No documentation found.

  Binding vmapntt! does not exist.

tbeason · March 3, 2020, 11:49pm

Did you do using LoopVectorization before trying that REPL search?

wizebt · March 3, 2020, 11:51pm

julia> using LoopVectorization

help?> vmapntt!
search:

Couldn't find vmapntt!
Perhaps you meant vmap! or vmap
  No documentation found.

  Binding vmapntt! does not exist.

tbeason · March 3, 2020, 11:54pm

I think you should try restarting Julia. I just updated to the latest release for LoopVectorization and the function is exported and available for use.

wizebt · March 3, 2020, 11:55pm

You win

jjulia> 
manu@m6800:~/julia$ julia 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

help?> vmapntt!
search: vmapntt! vmapntt vmapnt! vmapnt

  No documentation found.

  LoopVectorization.vmapntt! is a Function.

  # 1 method for generic function "vmapntt!":
  [1] vmapntt!(f::F, y::AbstractArray{T,1}, args::Vararg{Any,A}) where {F, T, A} in LoopVectorization at /home/manu/.julia/packages/LoopVectorization/3Q73l/src/map.jl:52

julia>

mthelm85 · March 4, 2020, 12:57pm

Reading above comments on limited memory bandwidth, can you please let me know what memory type speed you are using to run the comparison bench. I know Ryzen bench run faster with fast RAM. Thks

@wizebt The memory I have in my machine is Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600

Elrod · March 4, 2020, 1:51pm

The 3950X also has 64 MB of total L3 cache. Would you mind benchmarking vmapntt!?
It should avoid writing y to your cache. The entirety of x should then fit in L3.

mthelm85 · March 4, 2020, 2:10pm

Sure, I copied your code and ran it. Code with output is here:

Code:

using BenchmarkTools
using LoopVectorization

N=10000000
x=rand(Float32,N)
y=Array{Float32}(undef,N)

function sequential_t(y,x)
    n=length(x)-1
    a = 1.2f0
    b = 1-a
    for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
    return nothing
end

function parallel_t(y,x)
    n=length(x)-1
    a = 1.2f0
    b = 1-a
    Threads.@threads for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
    return nothing
end

f(x, y) = 1.2f0x - 0.2f0y

Output:

julia> @btime sequential_t($y, $x)
  5.014 ms (0 allocations: 0 bytes)

julia> yv = @view(similar(y)[1:end-1]); xv1 = @view(x[1:end-1]);  xv2 = @view(x[2:end])
9999999-element view(::Array{Float32,1}, 2:10000000) with eltype Float32:

julia> @btime vmapnt!(f, $yv, $xv1, $xv2)
  4.834 ms (0 allocations: 0 bytes)

julia> yv ≈ @view(y[1:end-1])
true

julia> fill!(y, NaN); fill!(yv, NaN)
9999999-element view(::Array{Float32,1}, 1:9999999) with eltype Float32:

julia> @btime parallel_t($y, $x)
  2.043 ms (226 allocations: 27.31 KiB)

julia> @btime vmapntt!(f, $yv, $xv1, $xv2)
  1.943 ms (226 allocations: 27.33 KiB)

Elrod · March 4, 2020, 2:29pm

Thanks. I was hoping for a much bigger improvement.
The fastest chips/setups here have been those with 4 or more memory channels. The 3950x, with only 2, is at a disadvantage there, but I was hoping that by obviating the need to read from RAM at all it would blow past them.

IIRC, on Zen2 the L3 is only shared within CCXes (sets of 4 cpres), meaning it has 4x 16MiB rather than one large, monolithic, 64 MiB cache.
The @threads macro uses the thread id to assign a chunk of the iteration space to a specific thread, so the chunk of x a thread iterates over will be the same across benchmark runs.
From there, it’s the OS’s job to keep that thread pinned to the same physical core.

You (or anyone else) have an idea if this or something else may be the problem?

mthelm85 · March 4, 2020, 3:04pm

No clue. I just built this PC in January and chose the Ryzen 9 3950X because of all the cores/threads. It was much more of a splurge than it was a need, I really don’t have much code that I’m running in parallel.

Topic		Replies	Views
Multi-threading doesn't seem to scale well on this counting problem Performance multithreading	15	2150	January 14, 2018
Same multi-threaded code, scaling observed only on some machines Performance	2	72	August 14, 2024
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1600	October 16, 2021
Speed up for-loop with multithreading Performance question , multithreading	10	1834	April 22, 2022
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1826	April 6, 2021

V1.3.1 no gain using multithread

Related topics