Why is LLVM missing this constant in a loop?

using BenchmarkTools
using LinearAlgebra

function f(x,y)
    z = 0.0
    for i in 1:10
        xy = x .* y
        z += norm(xy)
    end
    z
end

function fopt(x,y)
    z = 0.0
    xy = x .* y
    for i in 1:10
        z += norm(xy)
    end
    z
end

x=rand(2); y=rand(2)
@btime f($x,$y)
@btime fopt($x,$y)
  464.467 ns (10 allocations: 800 bytes)
  146.643 ns (1 allocation: 80 bytes)

Is it a constant?
I’m not familiar with LLVM but x and y are references to arrays which could change during execution of the loop.

See:

using BenchmarkTools
using LinearAlgebra

function f(x,y)
    z = 0.0 
    for i in 1:10
        xy = x .* y
        @show norm(xy) 
        z += norm(xy)
    end 
    z   
end

function g(x,y)
    for i = 1:10000000
        x .= rand(2)
        y .= rand(2)
    end
end

function f()
    x=rand(2); y=rand(2)
    Threads.@spawn g(x,y)
    f(x,y)
end


julia> f()
norm(xy) = 0.26590433662552576
norm(xy) = 0.603214499437178
norm(xy) = 1.0494104231262562
norm(xy) = 0.6612645779226971
norm(xy) = 0.3619931631859387
norm(xy) = 0.59579143483913
norm(xy) = 0.02516422097763711
norm(xy) = 0.6709932485823076
norm(xy) = 0.4531504941054873
norm(xy) = 0.19392321674706448
4.880809615549222

If you change that to Tuples (which are immutable), you’ll see the wanted behaviour:

function main()
    x=Tuple(rand(2)); y=Tuple(rand(2))
    @btime f($x,$y)
    @btime fopt($x,$y)
end 


# REPL
julia> main()
  42.424 ns (0 allocations: 0 bytes)
  41.396 ns (0 allocations: 0 bytes)
2 Likes

In my code, yes:

using LinearAlgebra

function f(x,y)
    z = 0.0
    for i in 1:10
        @show xy = x .* y
        z += norm(xy)
    end
    z
end

x=rand(2); y=rand(2)
f(x,y)
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]
xy = x .* y = [0.6666109828718713, 0.35500370844922313]

It may well be in practice, but to @roflmaostc’s point the compiler can’t prove it is. Arrays in particular are a bit of a black box right now (there has been a lot of work recently on changing this, but that hasn’t made it to stable Julia yet).

1 Like

In theory it could optimize that to:

function fopt2(x,y)
    xy = x .* y
    z = 10*norm(xy)
    z
end;

But it doesn’t. (And by hand you can avoid calculating and allocating xy, if that is performance critical)

Thanks to both of you!
One more reason to use tuples whenever possible.

The spawn example is very interesting and educative. Before that I would think that the compiler should optimize that out, and the fact that it doesn’t was an implementation detail.

Now, unlikely it is that any funcional code does what you have shown, such a change would be breaking.

What is the opinion of core developers on this? Can such optimizations be implemented in 1.x?

A new immutable Array type could fix that?
That would be just a new non-breaking feature, wouldn’t it be?

I cannot test now, but what happens with the mutable static arrays, MArray, of StaticArrays?

If in that case the optimization occurs, the behavior is inconsistent already.

(The immutable static arrays will behave like tuples here of course)

1 Like
function main()
    x=rand(2); y=rand(2)

    xR, yR = ReadOnlyArray.((x,y))
    xS, yS = SVector(x...), SVector(y...)
    xM, yM = MVector(x...), MVector(y...)
    @btime f($xR,$yR)
    @btime fopt($xR,$yR)
    @btime f($xS,$yS)
    @btime fopt($xS,$yS)
    @btime f($xM,$yM)
    @btime fopt($xM,$yM)
end


# REPL
julia> main()
  597.750 ns (10 allocations: 800 bytes)
  126.113 ns (1 allocation: 80 bytes)
  3.245 ns (0 allocations: 0 bytes)
  3.245 ns (0 allocations: 0 bytes)
  4.823 ns (0 allocations: 0 bytes)
  4.851 ns (0 allocations: 0 bytes)
3.523747450734117

Strange, that MVector does not make a difference?

It’s being worked on: https://github.com/JuliaLang/julia/pull/41777

MArray uses a backing NTuple for data storage, so presumably the optimizer has far more transparency to work with.