Kron vs scalar product speed difference. python code faster?

I can’t reproduce this.

foo(A,a,x1,y1,x2,y2) = A .+= a .* (x1 .* y1' .- x2 .* y2')
function bar(A,a,x1,y1,x2,y2, T1,T2)
       A_mul_Bt!(T1, x1, y1)
       A_mul_Bt!(T2, x2, y2)
       A .+= a .* (T1 .- T2)
end
m, n = 784, 225
A = zeros(m,n); a = 1.0; x1 = zeros(m); y1 = zeros(n); x2 = copy(x1); y2 = copy(y1); T1 = copy(A); T2 = copy(A);

using BenchmarkTools
@benchmark foo($A, $a, $x1, $y1, $x2, $y2)
@benchmark bar($A, $a, $x1, $y1, $x2, $y2, $T1, $T2)

On my machine, foo is about 360µs and bar is about 514µs, so fusing the loops and avoiding the T1 and T2 arrays gave me a 40% speedup. (This is only on 0.6, of course.)