How to perform parallel vector addition?

Hi guys~

I’m completely new to Julia and I just installed Julia this morning, so this question might seem a bit stupid.

I’ve learned that Julia has JIT and well-supported multithreading, so I just want to check it out. The first trial is to parallelly add two vectors like this:

a = rand(10000000)
b = rand(10000000)
c = a + b

We know that in Numpy, we have SIMD support; and if you use V(ector)M(ath)L(ibrary) in MKL, you can have multithreaded vector addition.

When I try this on Julia, I’m not sure if it is done by SIMD, and also I don’t know how to make it multithreaded. @threads seems to only work on for loops, and what I can get on the Internet is mostly about parallel reduce functions.

Did I miss something? I believe there must be some good ways to do this simple vector arithmetic.

in general something as simple as adding 2 vectors won’t benefit from multithreading (you’ll be bottle-necked by memory speed). Also, you’ll get roughly 3x better performance from a .+= b which will update a to have b added to it rather than allocate a new array and fill it with the sum.

If I understand it correctly, I would get 2x speed with parallelism if I have a double channel memory, right?

It’s not that simple. For example, dual channel DDR4-3200 has roughly 50GB/s bandwidth. A single zen2 core can do 2 256 bit adds per cycle which at 4ghz clock speed is 256 GB/s, so a single core could roughly keep up with 10 channels of DDR4 memory. For smaller vectors, the speeds will depend on caches though.

That said, if you’re coming from Numpy there are likely a lot of 10x speed improvements you can expect even without looking into multithreading. You should only multithread once you’re program is otherwise fully optimized. For example, rather than doing individual vector additions, doing more work per pass through memory (and allocating less memory) are the first steps to improving single core performance. Only once you’ve done that and reduced memory pressure does multithreading start to be worth it.

7 Likes

It’s true that we won’t get too much benefit, but on my machine, vector addition will be about 2x faster with multi-threading using C++. And numexpr with multithreads is about 4x faster than numpy without multithreads. (With a float array size of 1e8) (On AMD EPYC 7763 64core processor)

Just for a + b, where a and b are float array of size 1e8

Numpy takes 100ms
Numexpr takes 25ms
Julia a.+=b takes 120ms

I must miss smth because I didn’t see the performance I’m expecting.

I find this interesting, as my impression was that since numpy calls optimised C functions, it’s unlikely that Julia will perform significantly better for simple vector or BLAS operations. Where can one find a 10x improvement? Is it through loop fusion instead of re-allocating, as is presumably done by numpy?

Yeah. Numpy has a roughly 1 microsecond overhead per operation and encourages writing code in styles that lead to multiple passes over memory (and allocating).

I’m interested in this too. I’m curious about if Julia can fuse the loops intead of creating a lot of temporary arrays.

I’ve tested this expression: (a - b) / (a + b) + (a + b)

Numpy takes 350ms,
Julia (a .- b) ./ (a .+ b) .+ (a .+ b) takes 450 ms for float array of size 1e8

The advantage of using dotted operations is that they will fuse. For code like this, you might want to look at LoopVectorization.

f(a, b) = @. (a - b) / (a + b) + (a + b)

using LoopVectorization
g(a, b) = @turbo @. (a - b) / (a + b) + (a + b)
# threaded LoopVectorization
h(a, b) = @tturbo @. (a - b) / (a + b) + (a + b)

julia> @btime f(a,b);
  297.224 ms (2 allocations: 762.94 MiB)

julia> @btime g(a,b);
  198.549 ms (2 allocations: 762.94 MiB)

julia> @btime h(a,b);
  96.334 ms (2 allocations: 762.94 MiB)

As you can see now that you are doing something more complicated, multithreading actually does pay off. I’m getting a 3x boost from my 6 core CPU.

4 Likes

WOW, thanks a lot! I will play with this example.

Also note that your big AMD CPU likely has multiple memory domains which might lead to NUMA effects in the multi-threaded case. It can matter a lot (easily a factor of 2-4x) how/where you allocate the vectors for your computation when trying to utilize the entire CPU.

3 Likes

OH, thanks for reminding. I’m curious about how to determine or change the allocation of the vectors so that I can fully utilize the CPU? like how to make sure the data used by this node allocated to address local to this NUMA node?

1 Like

Some benchmarking on my laptop:

julia> using LoopVectorization, BenchmarkTools

julia> a = rand(10^8); b = rand(10^8);

julia> @btime @. a+b;
  167.266 ms (4 allocations: 762.94 MiB)

julia> @btime @turbo @. a+b;
  167.140 ms (4 allocations: 762.94 MiB)

julia> @btime @tturbo @. a+b;
  50.413 ms (4 allocations: 762.94 MiB)

More complex expression:

julia> f(a, b) = @. (a - b) / (a + b) + (a + b)
f (generic function with 1 method)

julia> g(a, b) = @turbo @. (a - b) / (a + b) + (a + b)
g (generic function with 1 method)

julia> h(a, b) = @tturbo @. (a - b) / (a + b) + (a + b)
h (generic function with 1 method)

julia> @btime f(a,b);
  472.276 ms (2 allocations: 762.94 MiB)

julia> @btime g(a,b);
  186.344 ms (2 allocations: 762.94 MiB)

julia> @btime h(a,b);
  51.559 ms (2 allocations: 762.94 MiB)

Version info:

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 20 on 20 virtual cores
3 Likes

Thanks for your benchmark. May I ask what the @. is for? Is this smth related to loop fusion macro?

I know that when you want to fuse the loop, you can write like this:

X .= (a - b) / (a + b) + (a + b)

Does it have anything to do with @.?

It’s just a way to make all the function calls broadcasting. :wink:

To get help on a function or macro, you can type ? followed by its name. For example:

help?> @.
  @. expr

  Convert every function call or operator in expr into a "dot call" (e.g. convert f(x) to f.(x)), and convert every
  assignment in expr to a "dot assignment" (e.g. convert += to .+=).

  If you want to avoid adding dots for selected function calls in expr, splice those function calls in with $. For
  example, @. sqrt(abs($sort(x))) is equivalent to sqrt.(abs.(sort(x))) (no dot for sort).

  (@. is equivalent to a call to @__dot__.)

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> x = 1.0:3.0; y = similar(x);

  julia> @. y = x + 3 * sin(x)
  3-element Vector{Float64}:
   3.5244129544236893
   4.727892280477045
   3.4233600241796016

Using .= just performs an element-wise assignment—taking the collection on the left, and reassigning all its entries to the values of the collection on the right. By contrast, using = is a reassignment of the identifier, and therefore discards the old collection and requires allocating a whole new collection.

For example:

julia> @time c = a .+ b;
  0.359202 seconds (4 allocations: 762.940 MiB, 8.45% gc time)

julia> @time c .= a .+ b;
  0.120703 seconds (2 allocations: 64 bytes)

Of course, using .= requires that a collection of the correct dimensions is actually there; if c is undeclared then it doesn’t work.

2 Likes

AHH I’ve learned a lot from all the replies here. So much thanks!

It’s difficult to search on the Internet about “@.”, maybe due to they are two special characters. help> is extremely useful.

1 Like

A simpler search term might be “Julia dot macro”, which seems to lead me to the right pages.

1 Like

Relying on first-touch-policy, a generic strategy is to allocate arrays in the same multi-threaded manner as they will later be used by the computational kernel. E.g., instead of using the serial rand or zeros functions, initialize the arrays using, say, @threads :static. Example: stream benchmark.

If you don’t want to rely on first-touch, you can use tools like numactl or ArrayAllocators.jl.

Update: @27rabbitlt Of course, in any case, you should pin your Julia threads, see here.

2 Likes

The first requires allocating c, which is say, 1 allocation. Add another ~10% gc time, the difference is still huge. As the same amount of memory needs to be read and written. Explanation? I think better measurement is needed.

Benchmarks can be a bit finicky, and you don’t include your benchmarking code. Did you use BenchmarkTools.jl, and follow its manual? And what about the numpy benchmark?

It may not make a difference in this case, but I am curious whether this discrepancy is real or just an artefact of improper benchmarking.