What makes a big difference is to keep your row vector a row vector type, not a Matrix:
julia> @btime sum(log.($p * $C));
717.158 μs (4 allocations: 86.16 KiB)
julia> pa = collect(range(0, 1000, length=Np))' # type Adjoint, not 1xN Matrix
1×200 adjoint(::Vector{Float64}) with eltype Float64:
0.0 5.02513 10.0503 15.0754 20.1005 25.1256 … 979.899 984.925 989.95 994.975 1000.0
julia> @btime sum(log.($pa * $C));
239.876 μs (4 allocations: 86.16 KiB)
julia> @btime sum(log, $pa * $C);
234.405 μs (2 allocations: 43.08 KiB)
julia> BLAS.vendor()
:openblas64
Same computer:
Python with Nx=5500, Np=200
236 µs ± 8.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using sum(log, x)
saves allocating another array for the result, but is a small effect here. Using MKL.jl might also help, instead of the default OpenBLAS library.
Further transposing the data in C, which if I’m thinking correctly ought to bring us closer to what Numpy is actually doing, gives:
julia> Ct = transpose(permutedims(C)); # same values, row major
julia> Ct == C
true
julia> @btime sum(log, $pa * $Ct);
210.394 μs (2 allocations: 43.08 KiB)