Sum(log.(p * C)) is 2 to 4 times slower than in NumPy

ForceBru · January 26, 2021, 8:38am

I’m coming from Python and trying to port my Python code to Julia to gain more performance.

I want to compute a lot of log-likelihoods in a loop, so I need a fast function that computes the log-likelihood. The computation is essentially sum(log.(p * C)), where C is a constant matrix and p is a row vector.

Here’s my Python code:

import numpy as np

def log_likelihood(p, C):
	return np.log(p @ C).sum()

Nx = 5500
Np = 200

p = np.linspace(0, 1000, Np)
C = np.random.rand(Np, Nx)

print(f"Python with Nx={Nx}, Np={Np}")
%timeit log_likelihood(p, C)

And Julia code:

using BenchmarkTools
using LinearAlgebra

log_likelihood_types(p::Array{Float64, 2}, C::Array{Float64, 2})::Float64 = sum(log.(p * C))
log_likelihood_no_types(p, C) = sum(log.(p * C))

Nx = 5500
Np = 200

p = range(0, 1000, length=Np)' |> collect
C = rand(Np, Nx)

# Compile
log_likelihood_types(p, C)
log_likelihood_no_types(p, C)

println("Julia with Nx=$Nx, Np=$Np; BLAS threads: $(BLAS.get_num_threads())")

@btime log_likelihood_types($p, $C)
@btime log_likelihood_no_types($p, $C)

Test results

forcebru ~/t/julia_numpy_bench> ipython row_mul.ipy
Python with Nx=5500, Np=200
349 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
forcebru ~/t/julia_numpy_bench> julia-1.6 row_mul.jl
Julia with Nx=5500, Np=200; BLAS threads: 4
  1.240 ms (4 allocations: 86.16 KiB)
  1.238 ms (4 allocations: 86.16 KiB)

So, 349 µs with NumPy and 1238 µs with Julia, so Julia is 3.5 times (!) slower. Also see this benchmark (https://gist.github.com/ForceBru/2da87f80725d62eb8f62454d53e00118) which shows similar results.

For Nx == 500 I’m getting:

Julia: 101.575 μs
NumPy: 36.9 µs ± 152 ns

Here Julia is 2.7 times slower!

Versions

julia version 1.6.0-beta1
np.__version__ == '1.19.5'
Python 3.7.1

Questions

Am I doing something wrong? Am I maybe measuring something incorrectly? Why the 3x speed difference? Is there a way to “fix” it?

moeddel · January 26, 2021, 8:58am

Better performance can be achieved by using

p = collect(range(0, 1000, length=Np))'

mcabbott · January 26, 2021, 8:59am

What makes a big difference is to keep your row vector a row vector type, not a Matrix:

julia> @btime sum(log.($p * $C));
  717.158 μs (4 allocations: 86.16 KiB)

julia> pa = collect(range(0, 1000, length=Np))' # type Adjoint, not 1xN Matrix
1×200 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  5.02513  10.0503  15.0754  20.1005  25.1256  …  979.899  984.925  989.95  994.975  1000.0

julia> @btime sum(log.($pa * $C));
  239.876 μs (4 allocations: 86.16 KiB)

julia> @btime sum(log, $pa * $C);
  234.405 μs (2 allocations: 43.08 KiB)

julia> BLAS.vendor()
:openblas64

Same computer:

Python with Nx=5500, Np=200
236 µs ± 8.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using sum(log, x) saves allocating another array for the result, but is a small effect here. Using MKL.jl might also help, instead of the default OpenBLAS library.

Further transposing the data in C, which if I’m thinking correctly ought to bring us closer to what Numpy is actually doing, gives:

julia> Ct = transpose(permutedims(C)); # same values, row major

julia> Ct == C
true

julia> @btime sum(log, $pa * $Ct);
  210.394 μs (2 allocations: 43.08 KiB)

carstenbauer · January 26, 2021, 8:59am

Julia is column-major order. Hence you should consider a column vector p and a transposed C and compute C * p in your log-likelihood function:

using BenchmarkTools
using LinearAlgebra

log_likelihood_no_types(p, C) = sum(log.(p * C))
log_likelihood_no_types_transpose(p, C) = sum(log.(C * p))

Nx = 5500
Np = 200

p = range(0, 1000, length=Np)' |> collect
C = rand(Np, Nx)
pt = collect(p')
Ct = Matrix(C')

# Compile
log_likelihood_no_types(p, C)
log_likelihood_no_types_transpose(pt, Ct)

println("Julia with Nx=$Nx, Np=$Np; BLAS threads: $(BLAS.get_num_threads())")

@btime log_likelihood_no_types($p, $C)
@btime log_likelihood_no_types_transpose($pt, $Ct)

I get:

julia> @btime log_likelihood_no_types($p, $C)
  538.149 μs (4 allocations: 86.16 KiB)
59499.619126650425

julia> @btime log_likelihood_no_types_transpose($pt, $Ct)
  245.567 μs (4 allocations: 86.16 KiB)
59499.619126650425

moeddel · January 26, 2021, 9:02am

Comparisons with NumPy seem to trigger immediate replies .

ForceBru · January 26, 2021, 9:03am

Woah, I’m now getting 12.213 μs (2 allocations: 8.12 KiB) instead of 101! So adjoint(::Vector{Float64}) is apparently faster than a row-vector… Makes sense, since Julia uses column-major order. But such a drastic difference, holy cow!

ForceBru · January 26, 2021, 9:04am

Yep, just changing to adjoint and column-major order makes a huge difference, wow!

carstenbauer · January 26, 2021, 9:07am

Just to give you a feeling:

M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

gives

julia> @btime fcol($M)

  673.501 μs (0 allocations: 0 bytes)

julia> @btime frow($M)
  1.422 ms (0 allocations: 0 bytes)

ForceBru · January 26, 2021, 9:07am

Need to show off Julia’s performance against NumPy? Pythonistas hate this one simple trick:

p = collect(range(0, 1000, length=Np))'

ForceBru · January 26, 2021, 9:09am

Duuuuude… Wait, I’m gonna test if NumPy’s performance on column vectors will be as slow as Julia’s with row vectors…

carstenbauer · January 26, 2021, 9:10am

Well, if you write the same loops in Python (i.e. do not use numpy) you’re doomed anyways

ForceBru · January 26, 2021, 9:14am

OK, so I changed p and C to column-major order in NumPy:

p = np.array(np.linspace(0, 1000, Np), order='F', copy=True)
C = np.array(np.random.rand(Np, Nx), order='F', copy=True)

And even got a speedup:

Python with Nx=5500, Np=200
288 µs ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But Julia got 251.563 μs anyway, so cool! Still strange that Julia got such a speedup from simple reordering. Yes, cache locality and ummm contiguity, but NumPy is always fast, while Julia depends on such quirks…

moeddel · January 26, 2021, 9:17am

Note that the timings you compare are not the same. In Julia @btime reports the fastest timing, whereas Python seems to report the mean.

ForceBru · January 26, 2021, 9:21am

Huh, I was initially using @benchmark, but then saw @btime somewhere and switched to that without reading the docs.

With println(mean(@benchmark log_likelihood_no_types($p, $C))) I’m now getting:

Julia with Nx=5500, Np=200; BLAS threads: 4
TrialEstimate(394.292 μs)

…which is still waaaaay better than the best time of 1238 μs I had before.

mcabbott · January 26, 2021, 9:30am

Transposing the data here is I think like Ct → C in my answer, and for me it slows down Python (and Julia).

I don’t think order='F' changes p at all, it’s still a (row) vector, so you still call a routine for matrix-vector products. The original big slowdown came from calling a (more complicated) routine for matrix-matrix multiplication, with one row. I can’t seem to make Python make this mistake, I guess it is checking the sizes at run-time, while Julia is going by the types.

Oscar_Smith · January 26, 2021, 1:42pm

It probably would be smart to automatically handle this. It should be pretty cheap.

baggepinnen · January 26, 2021, 5:51pm

Does log SIMD nowadays or you need LoopVectorization or Simdpirates etc. for that still?

Oscar_Smith · January 26, 2021, 6:00pm

At the very least, you need @simd ivdep, and might need LoopVectorization. most of our special functions should be relatively simdable, but the problem is that they need to get inlined to simd, and the functions aren’t marked @inline because you only want to inline them when you can vectorize them. Furthermore log and 64 bit exp are table based, you you need ivdep for @simd to realize that the table accesses don’t break things. @avx is probably easier at this point.

mihalybaci · January 26, 2021, 6:31pm

I can’t find the paper at the moment, but there was a paper written about BenchmarkTools where the authors showed that minimum time is actually a better estimator of performance than the mean. This is from the docs:

The minimum is a robust estimator for the location parameter of the time distribution,
and should not be considered an outlier
The median, as a robust measure of central tendency, should be relatively unaffected
by outliers
The mean, as a non-robust measure of central tendency, will usually be positively
skewed by outliers
The maximum should be considered a primarily noise-driven outlier, and can change
drastically between benchmark trials.

So you might want to just consider the minimum runtime for each benchmark.

stillyslalom · January 26, 2021, 11:39pm

If you’re doing this repeatedly, it pays to preallocate your output vector, and you can get some easy bonus gains with vmapreduce from LoopVectorization:

julia> @btime sum(log, $Ct*$pt)
  199.300 μs (2 allocations: 43.08 KiB)
59500.644576618346

julia> tmp = similar(pt, size(Ct, 1));

julia> @btime vmapreduce(log, +, mul!($tmp, $Ct, $pt))
  148.600 μs (0 allocations: 0 bytes)
59500.64457661832

Topic		Replies	Views
Performance of log.(x) General Usage performance , broadcast	11	2643	April 12, 2019
Sum of logs Performance	19	1012	May 10, 2022
Performance of logarithm calculation (when not to use the dot operator?) Performance	3	1198	April 24, 2018
How to speed up the calculation of Logit likelihood? Performance question	8	436	September 23, 2023
Is this function well optimized for speed? General Usage	13	901	November 13, 2019

Sum(log.(p * C)) is 2 to 4 times slower than in NumPy

Test results

Versions

Questions

Related topics