Vector addition in Julia slower than numpy in Linux

aminsadeghi · May 1, 2020, 2:07am

I tried this on Julia 1.4 on Ubuntu 19.10:

import numpy as np
a = np.random.rand(500_000)
b = np.random.rand(500_000)
%timeit a + b
115 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

using BenchmarkTools
a = rand(500_000)
b = rand(500_000)
@btime a .+ b;
545.357 μs (2 allocations: 3.81 MiB)

Strangely, I tried it on a Macbook and Julia was actually faster than numpy, maybe by 20% or something.

jlapeyre · May 1, 2020, 2:33am

This

a = np.random.rand(500_000)

errors with

    a = np.random.rand(500_000)
                             ^
SyntaxError: invalid syntax

aminsadeghi · May 1, 2020, 2:53am

Maybe it’s because of this line

%timeit a + b

You need to run this in IPython console, %timeit command won’t work otherwise.

stillyslalom · May 1, 2020, 3:08am

I’m seeing similar times on my machine:

>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.0004245378542691469

julia> @btime $a .+ $b
  436.089 μs (2 allocations: 781.33 KiB)

If you want to pull out the big guns,

julia> using LoopVectorization

julia> @btime vmapnt!(+, $c, $a, $b)
  80.321 μs (0 allocations: 0 bytes)

edit: machine info

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)

zhangliye · May 1, 2020, 3:08am

The followings are the results on my machine. Python is slightly faster.

import numpy as np
a = np.random.rand(500_000)
b = np.random.rand(500_000)
%timeit a + b
368 µs ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

using BenchmarkTools
a = rand(500_000)
b = rand(500_000)
@btime a .+ b;
495.616 μs (4 allocations: 3.81 MiB)

The OS information,

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

jlapeyre · May 1, 2020, 4:42am

That was with python 2 and possibly an older version of numpy. In any case, underscores in integer literals work in python 3 for me; and the error was a red herring.

aminsadeghi · May 1, 2020, 5:22am

Our Julia run times are more or less close, but my numpy timing’s ~4X faster than yours. So maybe it’s something new in numpy? What’s your numpy version? Mine’s 1.17.2

aminsadeghi · May 1, 2020, 5:25am

Same comment as @stillyslalom, our Julia timing matches, but not numpy’s. What’s your numpy version. Given that based on @stillyslalom’s values using LoopVectorization, I’m guessing that maybe that’s what they do in numpy?

stillyslalom · May 1, 2020, 5:32am

With the most recent version of Numpy, I get the same timings:

>>> np.__version__
'1.18.3'
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.00043422764912247656

zhangliye · May 1, 2020, 5:36am

numpy version: 1.17.4
python version: 3.7.6

OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core™ i5-8500 CPU @ 3.00GHz
WORD_SIZE: 64

dpsanders · May 1, 2020, 5:36am

Is numpy using multithreading?

Elrod · May 1, 2020, 7:12am

That is an impressive speedup!

I’m not seeing the magnitude, and not before a much larger size.
First, to get it out of the way

>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551134126028046
>>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', number=1000)/1000
0.000551150968996808

I actually happen to see worse performance at this size for vmapnt!, but performance is pretty similar to Python:

julia> using LoopVectorization, BenchmarkTools

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime $a .+ $b;
  563.405 μs (2 allocations: 3.81 MiB)

julia> @btime vmapnt!(+, $c1, $a, $b);
  522.161 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  511.718 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

But if I make the arrays 10x smaller, I do see an improvement:

julia> a = rand(50_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  27.126 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  31.831 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

Of course, performance improves by more than 10x because we’re far into the realm of memory-bottlenecks. But that also means that the nontemporal stores weren’t helping for vectors of length 500_000.
On my computer, the L3 cache can fit over 3,200,000 doubles.

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ sizeof(Float64)
3244032

julia> versioninfo()
Julia Version 1.5.0-DEV.696
Commit e09b3d6d1a* (2020-04-26 16:15 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, skylake)

So it can comfortably hold all three arrays.
So, lets see what happens if two arrays just fit in the L3 cache.

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[3] ÷ (2sizeof(Float64))
1622016

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  1.616 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  1.706 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Small difference. Make the arrays 50% longer:

julia> a = rand(3L >>> 1); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  2.668 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  3.060 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Letting 1 array fill the cache:

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  3.411 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  4.135 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Only half an array fits in the L3:

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  7.380 ms (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  9.140 ms (0 allocations: 0 bytes)

julia> c1 == c2
true

Dumps for L1 and L2 cache

For the L1 cache:

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  827.247 ns (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  393.866 ns (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  1.864 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  720.099 ns (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  3.767 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  1.391 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

and the L2:

julia> L = LoopVectorization.VectorizationBase.CACHE_SIZE[2] ÷ (2sizeof(Float64))
65536

julia> a = rand(L >>> 1); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  15.050 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  6.783 μs (0 allocations: 0 bytes)

julia> a = rand(3L >>> 2); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  27.858 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  25.433 μs (0 allocations: 0 bytes)

julia> a = rand(L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  43.216 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  59.119 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(2L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  137.887 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  131.966 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

julia> a = rand(4L); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapnt!(+, $c1, $a, $b);
  277.129 μs (0 allocations: 0 bytes)

julia> @btime $c2 .= $a .+ $b;
  267.334 μs (0 allocations: 0 bytes)

julia> c1 == c2
true

On my computer, when streaming single threaded additions, the nontemporal stores seem to lead to beter performance when either:

a and b are much larger than the L3 cache
a and b just barely fit in the L2 cache.

And equal or worse performance otherwise. I never get anything close to your 5x gain.

I need to use multithreading to just barely beat your time

julia> using LoopVectorization, BenchmarkTools

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime vmapntt!(+, $c1, $a, $b);
  64.925 μs (93 allocations: 12.89 KiB)

jbrea · May 1, 2020, 8:03am

On my (slow) machine julia is faster:

Python 3.8.2 (default, Apr  8 2020, 14:31:25) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np 
   ...: a = np.random.rand(500_000) 
   ...: b = np.random.rand(500_000) 
   ...: %timeit a + b                                                                          
1.29 ms ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: np.__version__                                                                         
Out[2]: '1.18.3'

julia> using BenchmarkTools

julia> a = rand(500_000);

julia> b = rand(500_000);

julia> @btime a .+ b;
  1.195 ms (4 allocations: 3.81 MiB)

julia> @btime $a .+ $b;
  1.176 ms (2 allocations: 3.81 MiB)

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, broadwell)

Edit: Don’t know if it is fairer to compare with or without interpolation…

juthohaegeman · May 1, 2020, 8:07am

I think Python uses multithreading. On my system

In [3]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=1000)/1000 
   ...:                                                                                                       
Out[3]: 0.0003835020289999989

In [4]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=1000)/1000 
   ...:                                                                                                       
Out[4]: 0.0003862037549999968

In [5]: >>> timeit.timeit('a + b', setup='import numpy as np; a=np.random.rand(500_000); b = np.copy(a)', numb
   ...: er=100000)/100000                                                                                     
Out[5]: 0.0003821436223

In the last run I used a huge number of repetitions to clearly see the CPU usage of Python using about 3 cores (on my quad core macbook pro, but with many other things running at the same time).

Using export JULIA_NUM_THREADS=4 and using Strided

julia> a = rand(500_000); b = rand(length(a));

julia> c1 = similar(a); c2 = similar(a);

julia> @btime $a .+ $b;
  470.189 μs (2 allocations: 3.81 MiB)

julia> @btime @strided $a .+ $b;
  417.899 μs (38 allocations: 3.82 MiB)

julia> @btime $c1 .= $a .+ $b;
  451.231 μs (0 allocations: 0 bytes)

julia> @btime @strided $c2 .= $a .+ $b;
  319.303 μs (34 allocations: 3.38 KiB)

Not sure why there is such a big difference between the allocating and non-allocating version. But certainly the threading strategy is overly complicated (and thus creating too much overhead) for such a simple problem.

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Azamat · May 1, 2020, 9:07am

Possibly of relevance: sum(a) is now 30% slower than NumPy · Issue #30290 · JuliaLang/julia · GitHub

aminsadeghi · May 1, 2020, 1:33pm

I guess that’s the case. But even so, I started Julia with 4 threads, set number of BLAS threads to 4, and still numpy beats Julia by 4X.

CameronBieganek · May 1, 2020, 2:43pm

I think in Julia a .+ b is inherently single threaded, even if you start Julia with 4 threads.

stillyslalom · May 1, 2020, 2:58pm

The extreme speedup I saw was (sadly) due to user error: I accidentally used arrays of length 100,000 rather than 500,000 for my vmapnt! run. Don’t try to benchmark past midnight, folks!

julia> @btime @. $c1 = $a + $b;
  407.343 μs (0 allocations: 0 bytes)

julia> @btime vmapnt!(+, $c1, $a, $b);
  407.012 μs (0 allocations: 0 bytes)

aminsadeghi · May 1, 2020, 3:09pm

I see. But, strangely, Julia matched numpy in my Macbook, I only see the 4X mismatch on Ubuntu 19.10.

DNF · May 1, 2020, 3:12pm

Something odd going on with Ubuntu maybe. On MacOS it’s pretty close:

In [11]: %timeit a + b                                                          
500 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

julia> @btime $a .+ $b;
  500.413 μs (2 allocations: 3.81 MiB)

Numpy 1.18.1
Julia 1.4.0

Topic		Replies	Views
How to perform parallel vector addition? New to Julia question	24	1820	December 27, 2022
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10986	October 15, 2019
Comparing Python, Julia, and C++ Performance broadcast , python	21	33201	November 1, 2018
General questions from Python user Performance	59	4322	March 8, 2021
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5552	May 19, 2018

Vector addition in Julia slower than numpy in Linux

Related topics