The reason @jling is asking is because there’s quite a difference between the implementations. The numpy version is using vector and matrix multiplication and is done in one fellow swoop, whereas the julia version is doing each element of the result vector sequentially (and allocating a bunch of memory at that during each run.)

For example:

```
using BenchmarkTools
function f_orig(p)
t0, t1 = p
m0 = [[cos(t0) - 1im*sin(t0) 0]; [0 cos(t0) + 1im*sin(t0)]]
m1 = [[cos(t1) - 1im*sin(t1) 0]; [0 cos(t1) + 1im*sin(t1)]]
r = m1*m0*[1. ; 0.]
return abs(r[1])^2
end
function g_orig(p,n)
return [f_orig(p[:,i]) for i in 1:n]
end
n = 10^6
p = 2 * pi * rand(2,n)
@btime g_orig($p,$n)
# result: 1.123 s (25000004 allocations: 1.63 GiB)
```

1.63GiB in allocations! And tiny ones at that, since there’s so many. Just rewriting those functions (albeit keeping the slower sequential version) makes it much faster:

```
using LinearAlgebra
function f!(t0,t1, m0, m1, res)
# Just fill the slots in the preallocated buffers
m0[1,1] = cos(t0) - 1im*sin(t0)
m0[2,2] = cos(t0) + 1im*sin(t0)
m1[1,1] = cos(t1) - 1im*sin(t1)
m1[2,2] = cos(t1) + 1im*sin(t1)
mul!(res, m1, m0) # multiply in-place
return abs(res[1,1])^2
end
function g(p,n)
# Since the size of the result is known, just allocate it once
res = zeros(Float64, n)
# Same goes for the intermediaries - we can reuse their memory in this sequential version
buf1 = zeros(ComplexF64, 2, 2)
buf2 = zeros(ComplexF64, 2, 2)
buf3 = zeros(ComplexF64, 2, 2)
for i in 1:n
res[i] = f!(p[1,i], p[2,i], buf1, buf2, buf3)
end
return res
end
# check so we haven't done away with anything
all(g(p,n) .≈ g_orig(p,n)) # approx since we're dealing with floating point
# true
@btime g($p,$n);
# result: 102.123 ms (5 allocations: 7.63 MiB)
```

Much faster, but still slower for me (the numpy version claims to run in ~500ns) since the algorithm is different.

I haven’t toyed around with reworking the functions completely to take even more advantage of the structure of your toy problem to use more matmuls and to avoid having the outer `g`

function, but I think we can easily get to the same speeds as the numpy version here. It’s BLAS all the way down anyway, and using those features should get us there. I’ll have to eat breakfast first though.