How to perform parallel vector addition?

I can replicate the measurements (although the length chosen seems 10x that in the OP, where N=10_000_000 leads to an array size of 76.29MiB).

julia> N = 10_000_000
10000000

julia> a = rand(N);

julia> b = rand(N);

julia> c = zero(a);

julia> @btime $a + $b;
  31.829 ms (2 allocations: 76.29 MiB)

julia> @btime $c .= $a .+ $b;
  14.550 ms (0 allocations: 0 bytes)

Oddly, I find

julia> @btime copyto!(similar($a), $a);
  36.947 ms (2 allocations: 76.29 MiB)

julia> @btime similar($a);
  1.926 μs (2 allocations: 76.29 MiB)

julia> @btime copyto!($(similar(a)), $a);
  5.902 ms (0 allocations: 0 bytes)

I’m unsure why the first variant is much slower, given that allocating the destination is cheap.

2 Likes

Sorry for late reply. Here is the running environment:

Julia Version 1.8.3 (2022-11-14)
Python Version 3.6.5 :: Anaconda, Inc.
Numpy Version 1.19.5
IPython Version 6.4.0

Numpy full code running in IPython Environment:

import numpy as np
a = np.random.rand(int(1e8))
b = np.random.rand(int(1e8))
%timeit a + b

99.1ms +- 173 us per loop

Julia full code running in interpretive command line:

n = 100000000
a = rand(n)
b = rand(n)
@time a + b
0.296798 seconds (262.64 k allocations: 776.461 MiB, 1.29% gc time, 15.76% compilation time)
c = zero(a)
@time c = a + b
0.259420 seconds (2 allocations: 762.939 MiB)

I’m sorry I didn’t figure out how to elegantly install packages in this air seperated PC (it is a calculation server in my company, without access to Internet), so I didn’t use these packages like LoopVectorization or BenchmarkTools. These results might not be that convincing.

Also I’ve always heard that in Julia you need to test time within a function instead of in global environment. So I then wrote this code:

n = 100000000
a = rand(n)
b = rand(n)
function test()
    @time a + b
end
test()
0.239909 seconds (2 allocations: 762.939 Mib)

It’s slightly faster.

I will try to install these packages in my WSL with Internet access, and then use PackageCompiler to create a sysimage, finally let Julia in that server start up with that sysimage. I don’t know whether this will work, but I will update my results as soon as possible.

You should also avoid using global variables, so

function test(a, b)
    @time a + b
end
test(a, b)

is how I’d do it.
However, @time forces top-level compilation (code behind dynamic dispatches still won’t be compiled until after the call starts), so using @time within a function generally isn’t necessary.

2 Likes

Thanks for advice! In this case, it basically makes no difference. But I guess I’ve understood why we need to avoid global things when benchmarking. Thanks again.

What you should generally avoid is accessing non-constant global variables. Defining a and b as const should solve that gotcha, but it may not affect timing much in this case, if the cost of the computation is much higher than accessing non-constant global variables.

1 Like