V1.3.1 no gain using multithread

Hi desperately trying to understand what is blocking scaling when using multithreading in a simple for loop. Taken from a real linear interpolation function I’m using. I need to enable multithreading to substantialy reduce process duration processing images.
Can’t understand what am I doing wrong.
Thank you

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4940MX CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

The code having issue

using Base.Threads

N=10000000;
x=rand(Float32,N);
y=Array{Float32}(undef,N);

function sequential_t(y,x)
	n=length(x)-1
	a = 1.2f0
	b = 1-a
	for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
	return nothing
end

function parallel_t(y,x)
	n=length(x)-1
	a = 1.2f0
	b = 1-a
	@threads for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
	return nothing
end

Bench results showing almost no gain using 4 threads over one

julia> @btime sequential_t($y,$x)
  5.779 ms (0 allocations: 0 bytes)

julia> @btime parallel_t($y,$x)
  5.561 ms (30 allocations: 3.03 KiB)

May be it’s the same problem as in How to run tasks in parallel?? Your calculation is too simple and you have memory access slowdown?

I often hear that these for loops are memory bounds and threading do not speed them up. You can probably speed things up with [LoopVectorization](https://github.com/chriselrod/LoopVectorization.jl or @simd

1 Like

Thanks for the @simd idea, here is the outcome

function sequential_t(y,x)
	n=length(x)-1
	a = 1.2f0
	b = 1-a
	@inbounds for i=1:n y[i] = x[i]a + x[i+1]b end
	return nothing
end

function simd_t(y,x)
	n=length(x)-1
	a = 1.2f0
	b = 1.0f0 - a
	@simd for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
	return nothing
end

No gain :frowning_face:

julia> @btime simd_t($y,$x)
  5.837 ms (0 allocations: 0 bytes)

julia> @btime sequential_t($y,$x)
  5.830 ms (0 allocations: 0 bytes)

As a base comparison, loops are much faster than the already very efficient broadcast approach

function t(y,x)
	a = 1.2f0
	b = 1-a
	y[1:end-1] .= x[1:end-1] .*a .+ y[2:end] .* b
	return nothing
end

Julia> @btime t($y,$x)
  37.733 ms (4 allocations: 76.29 MiB)

@simd doesn’t do anything here because vectorizing this loop will not change the result so the non-@simd version is already using SIMD. @simd is used when you need to add extra liberties to the compiler to reorder computations (which gives a different answer and is therefore not eligible as a normal optimization).

Thank you for your suggestion on @avx here is the outcome

function avx_t(y,x)
	n=length(x)-1
	a = 1.2f0
	b = 1.0f0 - a
	@inbounds @avx for i=1:n y[i] = x[i]a + x[i+1]b end
	return nothing
end
julia> @btime avx_t($y,$x)
  5.841 ms (0 allocations: 0 bytes)

@Kistoffer agreed and verified.
Any suggestion to distribute workload of this simple loop over multiple threads?

So far best result is using the available NVidia device, even this old one. The array transfer CPU → GPU the bottleneck

julia> using CUDAdrv; CUDAdrv.name(CuDevice(0))
"Quadro K4100M"

julia> using CUDAnative, CuArrays

y_d=CuArray(y);
x_d=CuArray(x);

function gpu_t!(y, x)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = blockDim().x * gridDim().x
	d = 1.2f0
	d1 = 1 - d
  for i = index:stride:length(y)-1
      @inbounds y[i] = x[i]d + x[i+1]d1
  end
  return
end

function bench_gpu_t!(y, x)
    numblocks = ceil(Int, length(y)/256)
    CuArrays.@sync begin
        @cuda threads=256 blocks=numblocks gpu_t!(y, x)
    end
end

julia> @btime bench_gpu_t!($y_d,$x_d)
  3.426 ms (42 allocations: 1.13 KiB)

You’ll run into the same memory bandwidth limitations with CPU → GPU → CPU transfers. Your arrays are each 4 * 10^7 bytes, which is too large to fit in cache. You can test your computer’s memory bandwidth with this tool–I found that I get about 15 GB/s when reading (& similar results when writing) a 40 MB array, which comes out to ~67 ps/byte. You’re reading & writing a total of 80 MB: 80e6 / 67e-12 = 5.3 milliseconds, which is right in line with your benchmarks. It’s not possible to go faster on a CPU.

6 Likes

@simd isn’t needed for this loop, because it is already simd:

; julia> @code_llvm debuginfo=:none sequential_t(y, x)

define nonnull %jl_value_t addrspace(10)* @japi1_sequential_t_17699(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %3, align 8
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, i64 1
  %5 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, align 8
  %6 = addrspacecast %jl_value_t addrspace(10)* %5 to %jl_value_t addrspace(11)*
  %7 = bitcast %jl_value_t addrspace(11)* %6 to %jl_array_t addrspace(11)*
  %8 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %7, i64 0, i32 1
  %9 = load i64, i64 addrspace(11)* %8, align 8
  %10 = add i64 %9, -1
  %11 = icmp sgt i64 %10, 0
  %12 = select i1 %11, i64 %10, i64 0
  br i1 %11, label %L14.preheader, label %L37

L14.preheader:                                    ; preds = %top
  %13 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, align 8
  %14 = bitcast %jl_value_t addrspace(11)* %6 to double addrspace(13)* addrspace(11)*
  %15 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %14, align 8
  %16 = addrspacecast %jl_value_t addrspace(10)* %13 to %jl_value_t addrspace(11)*
  %17 = bitcast %jl_value_t addrspace(11)* %16 to double addrspace(13)* addrspace(11)*
  %18 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %17, align 8
  %min.iters.check = icmp ult i64 %12, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.memcheck

vector.memcheck:                                  ; preds = %L14.preheader
  %scevgep = getelementptr double, double addrspace(13)* %18, i64 %12
  %19 = add nuw i64 %12, 1
  %scevgep10 = getelementptr double, double addrspace(13)* %15, i64 %19
  %bound0 = icmp ult double addrspace(13)* %18, %scevgep10
  %bound1 = icmp ult double addrspace(13)* %15, %scevgep
  %found.conflict = and i1 %bound0, %bound1
  br i1 %found.conflict, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.memcheck
  %n.vec = and i64 %12, 9223372036854775792
  %ind.end = or i64 %n.vec, 1
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = or i64 %index, 1
  %20 = getelementptr inbounds double, double addrspace(13)* %15, i64 %index
  %21 = bitcast double addrspace(13)* %20 to <4 x double> addrspace(13)*
  %wide.load = load <4 x double>, <4 x double> addrspace(13)* %21, align 8
  %22 = getelementptr inbounds double, double addrspace(13)* %20, i64 4
  %23 = bitcast double addrspace(13)* %22 to <4 x double> addrspace(13)*
  %wide.load15 = load <4 x double>, <4 x double> addrspace(13)* %23, align 8
  %24 = getelementptr inbounds double, double addrspace(13)* %20, i64 8
  %25 = bitcast double addrspace(13)* %24 to <4 x double> addrspace(13)*
  %wide.load16 = load <4 x double>, <4 x double> addrspace(13)* %25, align 8
  %26 = getelementptr inbounds double, double addrspace(13)* %20, i64 12
  %27 = bitcast double addrspace(13)* %26 to <4 x double> addrspace(13)*
  %wide.load17 = load <4 x double>, <4 x double> addrspace(13)* %27, align 8
  %28 = fmul <4 x double> %wide.load, <double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000>
  %29 = fmul <4 x double> %wide.load15, <double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000>
  %30 = fmul <4 x double> %wide.load16, <double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000>
  %31 = fmul <4 x double> %wide.load17, <double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000, double 0x3FF3333340000000>
  %32 = getelementptr inbounds double, double addrspace(13)* %15, i64 %offset.idx
  %33 = bitcast double addrspace(13)* %32 to <4 x double> addrspace(13)*
  %wide.load18 = load <4 x double>, <4 x double> addrspace(13)* %33, align 8
  %34 = getelementptr inbounds double, double addrspace(13)* %32, i64 4
  %35 = bitcast double addrspace(13)* %34 to <4 x double> addrspace(13)*
  %wide.load19 = load <4 x double>, <4 x double> addrspace(13)* %35, align 8
  %36 = getelementptr inbounds double, double addrspace(13)* %32, i64 8
  %37 = bitcast double addrspace(13)* %36 to <4 x double> addrspace(13)*
  %wide.load20 = load <4 x double>, <4 x double> addrspace(13)* %37, align 8
  %38 = getelementptr inbounds double, double addrspace(13)* %32, i64 12
  %39 = bitcast double addrspace(13)* %38 to <4 x double> addrspace(13)*
  %wide.load21 = load <4 x double>, <4 x double> addrspace(13)* %39, align 8
  %40 = fmul <4 x double> %wide.load18, <double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000>
  %41 = fmul <4 x double> %wide.load19, <double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000>
  %42 = fmul <4 x double> %wide.load20, <double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000>
  %43 = fmul <4 x double> %wide.load21, <double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000, double 0x3FC9999A00000000>
  %44 = fsub <4 x double> %28, %40
  %45 = fsub <4 x double> %29, %41
  %46 = fsub <4 x double> %30, %42
  %47 = fsub <4 x double> %31, %43
  %48 = getelementptr inbounds double, double addrspace(13)* %18, i64 %index
  %49 = bitcast double addrspace(13)* %48 to <4 x double> addrspace(13)*
  store <4 x double> %44, <4 x double> addrspace(13)* %49, align 8
  %50 = getelementptr inbounds double, double addrspace(13)* %48, i64 4
  %51 = bitcast double addrspace(13)* %50 to <4 x double> addrspace(13)*
  store <4 x double> %45, <4 x double> addrspace(13)* %51, align 8
  %52 = getelementptr inbounds double, double addrspace(13)* %48, i64 8
  %53 = bitcast double addrspace(13)* %52 to <4 x double> addrspace(13)*
  store <4 x double> %46, <4 x double> addrspace(13)* %53, align 8
  %54 = getelementptr inbounds double, double addrspace(13)* %48, i64 12
  %55 = bitcast double addrspace(13)* %54 to <4 x double> addrspace(13)*
  store <4 x double> %47, <4 x double> addrspace(13)* %55, align 8
  %index.next = add i64 %index, 16
  %56 = icmp eq i64 %index.next, %n.vec
  br i1 %56, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %12, %n.vec
  br i1 %cmp.n, label %L37, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %vector.memcheck, %L14.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L14.preheader ], [ 1, %vector.memcheck ]
  br label %L14

L14:                                              ; preds = %scalar.ph, %L14
  %value_phi3 = phi i64 [ %67, %L14 ], [ %bc.resume.val, %scalar.ph ]
  %57 = add nsw i64 %value_phi3, -1
  %58 = getelementptr inbounds double, double addrspace(13)* %15, i64 %57
  %59 = load double, double addrspace(13)* %58, align 8
  %60 = fmul double %59, 0x3FF3333340000000
  %61 = getelementptr inbounds double, double addrspace(13)* %15, i64 %value_phi3
  %62 = load double, double addrspace(13)* %61, align 8
  %63 = fmul double %62, 0x3FC9999A00000000
  %64 = fsub double %60, %63
  %65 = getelementptr inbounds double, double addrspace(13)* %18, i64 %57
  store double %64, double addrspace(13)* %65, align 8
  %66 = icmp eq i64 %value_phi3, %12
  %67 = add nuw i64 %value_phi3, 1
  br i1 %66, label %L37, label %L14

L37:                                              ; preds = %L14, %middle.block, %top
  ret %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140616413300176 to %jl_value_t*) to %jl_value_t addrspace(10)*)
}

Because rearranging the order doesn’t change rounding, LLVM doesn’t need @simd or @fastmath to vectorize the loop.

I do see a performance improvement from threading:

julia> @btime parallel_t($y, $x)
  842.107 μs (124 allocations: 17.16 KiB)

julia> @btime sequential_t($y, $x)
  7.035 ms (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.0-DEV.380
Commit a523fcf (2020-03-01 22:55 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 24

Increasing the number of cores should generally help memory-bandwidth constrained problems.

In terms of memory requirements

julia> (sizeof(y) + sizeof(x)) / 2^20
76.2939453125

It requires about 76 MiB. This CPU has about 2 MiB L2 cache per core, and 12 cores, and 20 MiB L3 cache shared between cores.
Is there any good resource to learn more about how all of this works? I would guess that the different L2 caches can stream from the L3 in parallel, but I’m not really sure.

1 Like

I’m seeing a significant speedup on my machine:

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_PKG_DEVDIR = C:/Users/mthel/Julia/MyPackages
  JULIA_EDITOR = "C:\Users\mthel\AppData\Local\Programs\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS = 32

And the results:

julia> @btime sequential_t($y,$x)
  4.672 ms (0 allocations: 0 bytes)

julia> @btime parallel_t($y,$x)
  1.710 ms (226 allocations: 27.31 KiB)

Cache size shouldn’t matter much for a once-through computation that exceeds the size of L3. There’s going to be a lot of machine-to-machine variability here, depending on the number of cores, memory channels, DIMM slot configuration, and RAM performance. Your Xeon chip has more cores (eight) & memory channels (four) than the OP’s Haswell (four & two, respectively), but neither has access to AVX512.

Results from a server with 2x Xeon E5-2695 v4 (36 cores):

julia> @btime sequential_t($y, $x)
  7.249 ms (0 allocations: 0 bytes)
julia> @btime parallel_t($y, $x)
  267.669 μs (237 allocations: 24.31 KiB)
1 Like

FWIW, one can also run the simple STREAM benchmark in Julia to measure/estimate the maximal memory bandwith: example notebook

2 Likes

Are you asking about general resources on things like memory bandwidth limited computations and cache hierarchy effects etc.? (I’m slightly confused given that you’re the LoopVectorization.jl hero :D)

Although this might be a misunderstanding, let me point you to the Roofline model (julia notebook example) and “Introduction to High Performance Computing for Scientists and Engineers” (in particular chapter 3.1).

2 Likes

Not that it matters too much, but isn’t there actually 1 READ + 1 WRITE ALLOCATE “=” 2 READ + 1 WRITE, i.e. memory transfer of 3*10_000_000*sizeof(Float32) ≈ 114 MB?

I’m assuming that there’s virtually no additional cost imposed the second adjacent ‘read’, since it’ll almost certainly be loaded in the same cache line as the first.

Perhaps I misunderstand, but I meant to say that one READ is for x and another READ + WRITE is necessary for y (I’m not talking about x[i] and x[i+1]). AFAIU, the additional READ for y is necessary because y hasn’t been used on the r.h.s of the assignment.

I’m no expert, but x86-64 uses register-memory architecture, not RISC-style load/store, so the vmovups instruction should write register values directly to memory without needing to read anything.

Ah, yeah, the entire cache should be evicted rather than get reused.

Which won’t help when memory bandwidth is the constraint anyway. FWIW, I also have:

julia> versioninfo()
Julia Version 1.5.0-DEV.362
Commit 6dd9f418ef (2020-02-28 22:34 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 36

julia> @btime sequential_t($y, $x)
  4.318 ms (0 allocations: 0 bytes)

julia> @btime parallel_t($y, $x)
  1.187 ms (183 allocations: 25.67 KiB)

This CPU does substantially better with more compute-heavy tasks like BLAS, but it only has 4 memory channels. However, Intel Ark claims that Xeon also only has 4 channels, not 6.
Furthermore, the Cascadelake-X chip’s memory should be clocked much higher. I’ll boot into the bios later to make sure the XMP settings are still enabled. It should be at 3200MHz (and rather low latency, but I don’t recall the timings at the moment. 14 CAS?).
Yet the Xeon did much better.

Your dual socket wins by a landslide. That’s an easy way to double your memory channels.

The applied projects I’ve worked on have all generally been the form of hundrends or thousands of MCMC chains. Each runs for a relatively long time on the same, not especially large, amount of data. Fairly ideal situation in terms of memory bandwidth and how-to-parallelize.

All I’ve been thinking about has been optimizations on the register and instruction level, so that’s the only thing LoopVectorization models at the moment. It doesn’t do any memory-based optimizations at all yet. All my benchmarks were over the size range of 2,…,256, and many of them peaked (in terms of GFLOPS) at less than 100.
With 1000x1000 matrices, single threaded Gaius was about 3x faster than LoopVectorization by simply performing recursive calls – recursive calls that termined with LoopVectorization kernels.

I plan on modeling memory eventually, once I can sideline a few other projects and obligations. When I do I plan on modeling the 3 cache layers, memory bandwidth, the TLB(s), the fact that certain cache lines compete, etc… Things I haven’t payed as much attention to, seem more opaque, and more variable. The more resources the merrier.
But regardless, I love to learn, so more information is always welcome. :slight_smile:

That said, I’ve added vmapnt! and vmapntt! to the master branch of LoopVectorization:

julia> using BenchmarkTools, LoopVectorization, Base.Threads

julia> N=10000000;

julia> x=rand(Float32,N);

julia> y=Array{Float32}(undef,N);

julia> function sequential_t(y,x)
               n=length(x)-1
               a = 1.2f0
               b = 1-a
               for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
               return nothing
       end
sequential_t (generic function with 1 method)

julia> function parallel_t(y,x)
               n=length(x)-1
               a = 1.2f0
               b = 1-a
               @threads for i=1:n @inbounds y[i] = x[i]a + x[i+1]b end
               return nothing
       end
parallel_t (generic function with 1 method)

julia> f(x, y) = 1.2f0x - 0.2f0y
f (generic function with 1 method)

julia> @btime sequential_t($y, $x)
  4.320 ms (0 allocations: 0 bytes)

julia> yv = @view(similar(y)[1:end-1]); xv1 = @view(x[1:end-1]);  xv2 = @view(x[2:end]);

julia> @btime vmapnt!(f, $yv, $xv1, $xv2);
  3.657 ms (0 allocations: 0 bytes)

julia> yv ≈ @view(y[1:end-1])
true

julia> fill!(y, NaN); fill!(yv, NaN);

julia> @btime parallel_t($y, $x)
  1.188 ms (183 allocations: 25.67 KiB)

julia> @btime vmapntt!(f, $yv, $xv1, $xv2);
  679.659 μs (183 allocations: 25.69 KiB)

julia> yv ≈ @view(y[1:end-1])
true

The vmapntt! function is threaded. They both use nontemporal stores. which in this case emits the vmovntps instruction:

L288:
        vmulps  -192(%rdx,%rsi,4), %zmm1, %zmm2
        vfmsub231ps     -192(%rbx,%rsi,4), %zmm0, %zmm2 # zmm2 = (zmm0 * mem) - zmm2
        vmovntps        %zmm2, -192(%rcx,%rsi,4)
        vmulps  -128(%rdx,%rsi,4), %zmm1, %zmm2
        vfmsub231ps     -128(%rbx,%rsi,4), %zmm0, %zmm2 # zmm2 = (zmm0 * mem) - zmm2
        vmovntps        %zmm2, -128(%rcx,%rsi,4)
        vmulps  -64(%rdx,%rsi,4), %zmm1, %zmm2
        vfmsub231ps     -64(%rbx,%rsi,4), %zmm0, %zmm2 # zmm2 = (zmm0 * mem) - zmm2
        vmovntps        %zmm2, -64(%rcx,%rsi,4)
        vmulps  (%rdx,%rsi,4), %zmm1, %zmm2
        vfmsub231ps     (%rbx,%rsi,4), %zmm0, %zmm2 # zmm2 = (zmm0 * mem) - zmm2
        vmovntps        %zmm2, (%rcx,%rsi,4)
        addq    $64, %rsi
        cmpq    %r8, %rsi
        jl      L288

EDIT: I decided to go ahead and tag it so that folks can try it out without checking out the master branch. v0.6.20 should be released shortly.

4 Likes