Sum(log.(p * C)) is 2 to 4 times slower than in NumPy

I’m coming from Python and trying to port my Python code to Julia to gain more performance.

I want to compute a lot of log-likelihoods in a loop, so I need a fast function that computes the log-likelihood. The computation is essentially sum(log.(p * C)), where C is a constant matrix and p is a row vector.

Here’s my Python code:

import numpy as np

def log_likelihood(p, C):
	return np.log(p @ C).sum()

Nx = 5500
Np = 200

p = np.linspace(0, 1000, Np)
C = np.random.rand(Np, Nx)

print(f"Python with Nx={Nx}, Np={Np}")
%timeit log_likelihood(p, C)

And Julia code:

using BenchmarkTools
using LinearAlgebra

log_likelihood_types(p::Array{Float64, 2}, C::Array{Float64, 2})::Float64 = sum(log.(p * C))
log_likelihood_no_types(p, C) = sum(log.(p * C))

Nx = 5500
Np = 200

p = range(0, 1000, length=Np)' |> collect
C = rand(Np, Nx)

# Compile
log_likelihood_types(p, C)
log_likelihood_no_types(p, C)

println("Julia with Nx=$Nx, Np=$Np; BLAS threads: $(BLAS.get_num_threads())")

@btime log_likelihood_types($p, $C)
@btime log_likelihood_no_types($p, $C)

Test results

forcebru ~/t/julia_numpy_bench> ipython row_mul.ipy
Python with Nx=5500, Np=200
349 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
forcebru ~/t/julia_numpy_bench> julia-1.6 row_mul.jl
Julia with Nx=5500, Np=200; BLAS threads: 4
  1.240 ms (4 allocations: 86.16 KiB)
  1.238 ms (4 allocations: 86.16 KiB)

So, 349 µs with NumPy and 1238 µs with Julia, so Julia is 3.5 times (!) slower. Also see this benchmark (https://gist.github.com/ForceBru/2da87f80725d62eb8f62454d53e00118) which shows similar results.

For Nx == 500 I’m getting:

  • Julia: 101.575 μs
  • NumPy: 36.9 µs ± 152 ns

Here Julia is 2.7 times slower!

Versions

  • julia version 1.6.0-beta1
  • np.__version__ == '1.19.5'
  • Python 3.7.1

Questions

Am I doing something wrong? Am I maybe measuring something incorrectly? Why the 3x speed difference? Is there a way to “fix” it?

Better performance can be achieved by using

p = collect(range(0, 1000, length=Np))'
1 Like

What makes a big difference is to keep your row vector a row vector type, not a Matrix:

julia> @btime sum(log.($p * $C));
  717.158 μs (4 allocations: 86.16 KiB)

julia> pa = collect(range(0, 1000, length=Np))' # type Adjoint, not 1xN Matrix
1×200 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  5.02513  10.0503  15.0754  20.1005  25.1256  …  979.899  984.925  989.95  994.975  1000.0

julia> @btime sum(log.($pa * $C));
  239.876 μs (4 allocations: 86.16 KiB)

julia> @btime sum(log, $pa * $C);
  234.405 μs (2 allocations: 43.08 KiB)

julia> BLAS.vendor()
:openblas64

Same computer:

Python with Nx=5500, Np=200
236 µs ± 8.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using sum(log, x) saves allocating another array for the result, but is a small effect here. Using MKL.jl might also help, instead of the default OpenBLAS library.

Further transposing the data in C, which if I’m thinking correctly ought to bring us closer to what Numpy is actually doing, gives:

julia> Ct = transpose(permutedims(C)); # same values, row major

julia> Ct == C
true

julia> @btime sum(log, $pa * $Ct);
  210.394 μs (2 allocations: 43.08 KiB)
7 Likes

Julia is column-major order. Hence you should consider a column vector p and a transposed C and compute C * p in your log-likelihood function:

using BenchmarkTools
using LinearAlgebra

log_likelihood_no_types(p, C) = sum(log.(p * C))
log_likelihood_no_types_transpose(p, C) = sum(log.(C * p))

Nx = 5500
Np = 200

p = range(0, 1000, length=Np)' |> collect
C = rand(Np, Nx)
pt = collect(p')
Ct = Matrix(C')

# Compile
log_likelihood_no_types(p, C)
log_likelihood_no_types_transpose(pt, Ct)

println("Julia with Nx=$Nx, Np=$Np; BLAS threads: $(BLAS.get_num_threads())")

@btime log_likelihood_no_types($p, $C)
@btime log_likelihood_no_types_transpose($pt, $Ct)

I get:

julia> @btime log_likelihood_no_types($p, $C)
  538.149 μs (4 allocations: 86.16 KiB)
59499.619126650425

julia> @btime log_likelihood_no_types_transpose($pt, $Ct)
  245.567 μs (4 allocations: 86.16 KiB)
59499.619126650425
2 Likes

Comparisons with NumPy seem to trigger immediate replies :rofl:.

12 Likes

Woah, I’m now getting 12.213 μs (2 allocations: 8.12 KiB) instead of 101! So adjoint(::Vector{Float64}) is apparently faster than a row-vector… Makes sense, since Julia uses column-major order. But such a drastic difference, holy cow!

Yep, just changing to adjoint and column-major order makes a huge difference, wow!

Just to give you a feeling:

M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

gives

julia> @btime fcol($M)

  673.501 μs (0 allocations: 0 bytes)

julia> @btime frow($M)
  1.422 ms (0 allocations: 0 bytes)

Need to show off Julia’s performance against NumPy? Pythonistas hate this one simple trick:

p = collect(range(0, 1000, length=Np))'
3 Likes

Duuuuude… Wait, I’m gonna test if NumPy’s performance on column vectors will be as slow as Julia’s with row vectors…

1 Like

Well, if you write the same loops in Python (i.e. do not use numpy) you’re doomed anyways :slight_smile:

2 Likes

OK, so I changed p and C to column-major order in NumPy:

p = np.array(np.linspace(0, 1000, Np), order='F', copy=True)
C = np.array(np.random.rand(Np, Nx), order='F', copy=True)

And even got a speedup:

Python with Nx=5500, Np=200
288 µs ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But Julia got 251.563 μs anyway, so cool! Still strange that Julia got such a speedup from simple reordering. Yes, cache locality and ummm contiguity, but NumPy is always fast, while Julia depends on such quirks…

1 Like

Note that the timings you compare are not the same. In Julia @btime reports the fastest timing, whereas Python seems to report the mean.

1 Like

Huh, I was initially using @benchmark, but then saw @btime somewhere and switched to that without reading the docs.

With println(mean(@benchmark log_likelihood_no_types($p, $C))) I’m now getting:

Julia with Nx=5500, Np=200; BLAS threads: 4
TrialEstimate(394.292 μs)

…which is still waaaaay better than the best time of 1238 μs I had before.

1 Like

Transposing the data here is I think like CtC in my answer, and for me it slows down Python (and Julia).

I don’t think order='F' changes p at all, it’s still a (row) vector, so you still call a routine for matrix-vector products. The original big slowdown came from calling a (more complicated) routine for matrix-matrix multiplication, with one row. I can’t seem to make Python make this mistake, I guess it is checking the sizes at run-time, while Julia is going by the types.

6 Likes

It probably would be smart to automatically handle this. It should be pretty cheap.

3 Likes

Does log SIMD nowadays or you need LoopVectorization or Simdpirates etc. for that still?

At the very least, you need @simd ivdep, and might need LoopVectorization. most of our special functions should be relatively simdable, but the problem is that they need to get inlined to simd, and the functions aren’t marked @inline because you only want to inline them when you can vectorize them. Furthermore log and 64 bit exp are table based, you you need ivdep for @simd to realize that the table accesses don’t break things. @avx is probably easier at this point.

2 Likes

I can’t find the paper at the moment, but there was a paper written about BenchmarkTools where the authors showed that minimum time is actually a better estimator of performance than the mean. This is from the docs:

The minimum is a robust estimator for the location parameter of the time distribution,
and should not be considered an outlier
The median, as a robust measure of central tendency, should be relatively unaffected
by outliers
The mean, as a non-robust measure of central tendency, will usually be positively
skewed by outliers
The maximum should be considered a primarily noise-driven outlier, and can change
drastically between benchmark trials.

So you might want to just consider the minimum runtime for each benchmark.

2 Likes

If you’re doing this repeatedly, it pays to preallocate your output vector, and you can get some easy bonus gains with vmapreduce from LoopVectorization:

julia> @btime sum(log, $Ct*$pt)
  199.300 μs (2 allocations: 43.08 KiB)
59500.644576618346

julia> tmp = similar(pt, size(Ct, 1));

julia> @btime vmapreduce(log, +, mul!($tmp, $Ct, $pt))
  148.600 μs (0 allocations: 0 bytes)
59500.64457661832
5 Likes