Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now]

With numpy I get ~100ns per element
With julia ~1200ns
I must be missing something…
How can I reach same speed with julia ?

Code below:
Python:

import time
import numpy as np
from numpy import sin, cos

def f(p):
    t0,t1 = p
    m0 = np.array([[cos(t0) - 1j*sin(t0), 0], [0, cos(t0) + 1j*sin(t0)]])
    m1 = np.array([[cos(t1) - 1j*sin(t1), 0], [0, cos(t1) + 1j*sin(t1)]])
    r = m1*m0*np.array([1., 0.])
    return np.abs(r[0])**2

n= 10**6
p = 2 * np.pi * np.random.rand(2,n)

t0 = time.time()
x = f(p)
time_taken = time.time()-t0
print("time per unit in ns:",time_taken*10**9/n)

Julia:

function f(p)
    t0,t1 = p
    m0 = [[cos(t0) - 1im*sin(t0)  0]; [0  cos(t0) + 1im*sin(t0)]]
    m1 = [[cos(t1) - 1im*sin(t1)  0]; [0  cos(t1) + 1im*sin(t1)]]
    r = m1*m0*[1. ; 0.]
    return abs(r[1])^2
end

function g(p,n)
    return [f(p[:,i]) for i=1:n]
end

g(rand(2,3),3)  # call to force jit compilation

n = 10^6
p = 2*pi*rand(2,n)

t0 = time_ns()
g(p,n)
time_taken = time_ns() - t0
println("time per unit in ns: ",time_taken/n)

actually, did you want matrix multiplication for each inner pair of m0, m1? (as two 4x4 matmul?)

also notice in python:

>>> t0, t1 = 2 * np.pi * np.random.rand(2,10)
>>> t0
array([5.73602205e+00, 4.01387168e+00, 1.06611118e+00, 2.49225193e+00,
       1.97764814e+00, 3.55630974e-03, 6.14979820e+00, 5.12547165e+00,
       2.09145452e+00, 4.69748937e+00])

but in julia:

julia> t0,t1 = 2*pi*rand(2,10)
2×10 Array{Float64,2}:
 6.11896  1.04278  3.54547  5.70549  1.45539  3.53701  4.84305  2.51523  0.497157  1.66334
 3.27382  4.43759  5.66304  3.79966  3.79987  5.34328  5.43598  5.03836  3.27738   5.1513 

julia> t0
6.118956878728962

Yes, this is matrix multiplication.
Yes, exactly in python there is no explicit loop but the element wise numpy operation instead. In julia the loop is explicit (in the g function).
The two pieces of code do exactly the same thing though

The reason @jling is asking is because there’s quite a difference between the implementations. The numpy version is using vector and matrix multiplication and is done in one fellow swoop, whereas the julia version is doing each element of the result vector sequentially (and allocating a bunch of memory at that during each run.)

For example:

using BenchmarkTools

function f_orig(p)
    t0, t1 = p
    m0 = [[cos(t0) - 1im*sin(t0)  0]; [0  cos(t0) + 1im*sin(t0)]]
    m1 = [[cos(t1) - 1im*sin(t1)  0]; [0  cos(t1) + 1im*sin(t1)]]
    r = m1*m0*[1. ; 0.]
    return abs(r[1])^2
end

function g_orig(p,n)
    return [f_orig(p[:,i]) for i in 1:n]
end

n = 10^6
p = 2 * pi * rand(2,n)

@btime g_orig($p,$n)
# result: 1.123 s (25000004 allocations: 1.63 GiB)

1.63GiB in allocations! And tiny ones at that, since there’s so many. Just rewriting those functions (albeit keeping the slower sequential version) makes it much faster:

using LinearAlgebra

function f!(t0,t1, m0, m1, res)
    # Just fill the slots in the preallocated buffers
    m0[1,1] = cos(t0) - 1im*sin(t0)
    m0[2,2] = cos(t0) + 1im*sin(t0)
    m1[1,1] = cos(t1) - 1im*sin(t1)
    m1[2,2] = cos(t1) + 1im*sin(t1)
    mul!(res, m1, m0) # multiply in-place
    return abs(res[1,1])^2
end

function g(p,n)
    # Since the size of the result is known, just allocate it once
    res = zeros(Float64, n)
    
    # Same goes for the intermediaries - we can reuse their memory in this sequential version
    buf1 = zeros(ComplexF64, 2, 2)
    buf2 = zeros(ComplexF64, 2, 2)
    buf3 = zeros(ComplexF64, 2, 2)
    for i in 1:n
        res[i] = f!(p[1,i], p[2,i], buf1, buf2, buf3)
    end
    return res
end

# check so we haven't done away with anything
all(g(p,n) .≈ g_orig(p,n)) # approx since we're dealing with floating point
# true

@btime g($p,$n);
# result:   102.123 ms (5 allocations: 7.63 MiB)

Much faster, but still slower for me (the numpy version claims to run in ~500ns) since the algorithm is different.

I haven’t toyed around with reworking the functions completely to take even more advantage of the structure of your toy problem to use more matmuls and to avoid having the outer g function, but I think we can easily get to the same speeds as the numpy version here. It’s BLAS all the way down anyway, and using those features should get us there. I’ll have to eat breakfast first though.

3 Likes

Sukera has explained in detail the problem of allocating the comprenhension list (in function g). Notice that this also happens in Python. If I rewrite the code in a very similar way to your julia code, that is:

import time
import numpy as np
from numpy import sin, cos

def f(p):
    t0,t1 = p
    m0 = np.array([[cos(t0) - 1j*sin(t0), 0], [0, cos(t0) + 1j*sin(t0)]])
    m1 = np.array([[cos(t1) - 1j*sin(t1), 0], [0, cos(t1) + 1j*sin(t1)]])
    r = m1*m0*np.array([1., 0.])
    return np.abs(r[0])**2


def g(p,n):
    return [f(p[:,i]) for i in range(n)]

n= 10**6
p = 2 * np.pi * np.random.rand(2,n)

t0 = time.time()
x = g(p,n)
time_taken = time.time()-t0
print("time per unit in ns:",time_taken*10**9/n)

Then I get

time per unit in ns: 44241.082191467285

The julia code takes

time per unit in ns: 1370.263327

Therefore, a similar Python code to the julia code you provide takes 33x respect to the julia code.

3 Likes

Yes; still, the julia code can very easily be written like the numpy code to take advantage of the structure and use proper BLAS routines for everything, which should speed it up even more than the 10x relative to the original from simply reusing buffers.

Oh, yes I am running the python3 anaconda distribution that uses MLK for BLAS, that might explain the difference.
in Julia I also added MLK, but it didn’t make any difference

We’re still a factor 1000x away from the numpy version, which can hardly be explained by the difference in BLAS :slight_smile: Notice that the only BLAS call in my julia version should be the mul! for the matrix multiply - and since the matrices are tiny, there’s not going to be much of a difference between OpenBlas vs MKL.

I’m newb to julia and wasn’t able to replicate the python/numpy algorithm in julia… I’ll love to have something faster.
Having it in julia will also make easier to do parallelism (and gpu)

Yes, I agree: python is painfully slow. That’s why I am moving to julia :slight_smile:

maybe try something like this, I don’t know what the last r should be, at the moment r is an Array of [complex1, complex2]

function f(p)
    t0,t1 = p[1,:], p[2,:]
    r = Array{Array{Complex}}(undef, length(t0))
    for i=1:length(t0)
        r[i] = [cos(t0[i]) - im*sin(t0[i]) 0; 0 cos(t0[i]) + im*sin(t0[i])] *
        [cos(t1[i]) - im*sin(t1[i]) 0; 0 cos(t1[i]) + im*sin(t1[i])] *
        [1; 0]
    end
    return ...
end

Thanks ! This looks nice !
I’ll give it a try in a moment (I need to take a break first)

yea also maybe you can directly get the abs ^2 as well in the for loop since we’re already in the for loop hole… though, I don’t think this will be particularly fast, my fear is that since the matrix computation is kinda complicated, maybe some numpy optimization is lost, anyways, let us know how it turned out.

if im understanding, you can do this to get the float number out directly (making r = Array{Float64}(undef, length(t0)) instead)

for ...
x = [cos(t0[i]) - im*sin(t0[i]) 0; 0 cos(t0[i]) + im*sin(t0[i])] *
        [cos(t1[i]) - im*sin(t1[i]) 0; 0 cos(t1[i]) + im*sin(t1[i])] *
        [1; 0]
        r[i] = abs(x[1])^2
end

There are usually no better optimizations than not doing something


@inline function f!(t0,t1, m0, m1, res)
    # Just fill the slots in the preallocated buffers
    cos_t0 = cos(t0)
    sin_t0 = 1im*sin(t0)
    cos_t1 = cos(t1)
    sin_t1 = 1im*sin(t1)
    
    m0[1,1] = cos_t0 - sin_t0
    m0[2,2] = cos_t0 + sin_t0
    m1[1,1] = cos_t1 - sin_t1
    m1[2,2] = cos_t1 + sin_t1
    mul!(res, m1, m0) # multiply in-place
    return abs(res[1,1])^2
end

function g(p,n)
    # Since the size of the result is known, just allocate it once
    res = zeros(Float64, n)
    
    # Same goes for the intermediaries - we can reuse their memory in this sequential version
    buf1 = zeros(ComplexF64, 2, 2)
    buf2 = zeros(ComplexF64, 2, 2)
    buf3 = zeros(ComplexF64, 2, 2)
    @inbounds for i in 1:n
        res[i] = f!(p[1,i], p[2,i], buf1, buf2, buf3)
    end
    return res
end


# check so we haven't done away with anything
#all(g(p,n) .≈ g_orig(p,n)) # approx since we're dealing with floating point
# true

@btime g($p,$n);

Reusing the cos and sin computations I get 83.915 ms vs 126.427 ms that it takes if I don’t reuse them.

I think another point here is to write concise program… this looks like C++ more than what we wish julia could deliver (sin cos reuse is 100% valid)

Actually I was measuring the time in an incorrect way. There are allways details that might be a bit slippery when comparing implementations across programming languages:

The python version

%%timeit
f(p)
103 ms ± 37.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The julia version reusing sin and cos

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  5
  --------------
  minimum time:     85.433 ms (0.00% GC)
  median time:      88.648 ms (0.00% GC)
  mean time:        89.843 ms (1.60% GC)
  maximum time:     148.585 ms (38.16% GC)
  --------------
  samples:          56
  evals/sample:     1

The julia version is then at 89 ms of mean time vs the 103 ms of mean time of the python version.

1 Like

Perhaps this isn’t the real computation, but it appears to be multiplying two diagonal matrices and keeping only the top-left element. abs(res[1,1])^2 == abs2((cos_t0 - sin_t0) * (cos_t1 - sin_t1)) == abs2(cis(t0) * cis(t1)). Which in fact simplifies to 1!

1 Like

I got a factor of 16 improvement compared to your original Julia code by going to the following code, which uses StaticArrays to avoid allocating zillions of 2x2 matrices etcetera. A lot easier than writing in-place code on pre-allocated arrays, and probably faster because StaticArray operations are completely unrolled (which speeds up operations on tiny arrays).

using StaticArrays, BenchmarkTools

function f(t0,t1)
    m0 = @SMatrix [ cis(-t0) 0 ; 0 cis(t0)]
    m1 = @SMatrix [ cis(-t1) 0 ; 0 cis(t1)]
    r = m1 * (m0 * @SVector [1. , 0.])
    return abs2(r[1])
end

g(p) = [f(p[1,i],p[2,i]) for i in axes(p,2)]

time_taken = @belapsed g($p)
println("time per unit in ns: ",1e9 * time_taken/n)

Another 30% improvement by factoring out the repeated trig calculations:

function f(t0,t1)
    cis0, cis1 = cis(t0), cis(t1)
    m0 = @SMatrix [ conj(cis0) 0 ; 0 cis0]
    m1 = @SMatrix [ conj(cis1) 0 ; 0 cis1]
    r = m1 * (m0 * @SVector [1. , 0.])
    return abs2(r[1])
end

As others have noted, the actual computation can be simplified even further to g(p) = ones(size(p,2)), but that kind of defeats the point of the benchmark, so I’m not 100% sure what simplifications are “allowed” here. Definitely using StaticArrays for small array literals is a good choice, though.

8 Likes

In fact, the shape of the numpy arrays are (2,2), with the top right and bottom left element being simply 0 and the top left and bottom right being the vectors after mapping x -> cos(x) - sin(x)im and x -> cos(x) + sin(x) im over t0 and t1 respectively. Doing the “matmul” and then the mul with [1.,0.] amounts to just the upper left corner needing to be calculated.

Exact, this is not the real computation, just a silly example going through the same steps. And you spotted it was doing nothing.

I am writing a quantum computer simulator, for something like this https://qml.entropicalabs.com/ (this one is in js)