Sum(log.(p * C)) is 2 to 4 times slower than in NumPy

What makes a big difference is to keep your row vector a row vector type, not a Matrix:

julia> @btime sum(log.($p * $C));
  717.158 μs (4 allocations: 86.16 KiB)

julia> pa = collect(range(0, 1000, length=Np))' # type Adjoint, not 1xN Matrix
1×200 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  5.02513  10.0503  15.0754  20.1005  25.1256  …  979.899  984.925  989.95  994.975  1000.0

julia> @btime sum(log.($pa * $C));
  239.876 μs (4 allocations: 86.16 KiB)

julia> @btime sum(log, $pa * $C);
  234.405 μs (2 allocations: 43.08 KiB)

julia> BLAS.vendor()
:openblas64

Same computer:

Python with Nx=5500, Np=200
236 µs ± 8.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using sum(log, x) saves allocating another array for the result, but is a small effect here. Using MKL.jl might also help, instead of the default OpenBLAS library.

Further transposing the data in C, which if I’m thinking correctly ought to bring us closer to what Numpy is actually doing, gives:

julia> Ct = transpose(permutedims(C)); # same values, row major

julia> Ct == C
true

julia> @btime sum(log, $pa * $Ct);
  210.394 μs (2 allocations: 43.08 KiB)
7 Likes