Sum over LazyArray is slower than over regular array

julia> N = 64; A = randn(N,N); B = randn(N,N); C = similar(A,1,N);

Using Lazy Arrays:

julia> @btime sum!($C,LazyArray(@~ $A.*$B))
  3.047 μs (0 allocations: 0 bytes)

Without LazyArrays:

julia> @btime sum!($C,$A.*$B)
  1.163 μs (3 allocations: 32.08 KiB)

Even with allocations, regular sum! is about 3x faster on my machine. Is this expected behavior?

Interestingly, using map! with eachcol outperforms both approaches:

julia> using LinearAlgebra

julia> @btime map!(⋅,$C,eachcol($A),eachcol($B))
  788.132 ns (0 allocations: 0 bytes)

Here’s the output of versioninfo:

julia> versioninfo()
Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 4 default, 1 interactive, 4 GC (on 4 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
  JULIA_PKG_USE_CLI_GIT = true

it’s not at all surprising that Array outperforms LazyArray. something like sum(::Array) is a very highly-trafficked method and thus has had a lot of attention for optimization. the nature of LazyArray means it’s probably going to have to do more work on access, might not be able to vectorize as well, etc.

I’m not sure about your map! with eachcol form. if I had to guess, it’s because LinearAlgebra.dot gets to use BLAS while Base.sum is all-Julia

Tullio.jl is 2x faster

julia> using Tullio

julia> @btime @tullio $C[1, j] = $A[i, j] * $B[i, j]
  622.337 ns (0 allocations: 0 bytes)

Wow! I’ll check that package out. Those are impressive numbers.