Why the compiler can't optimize this simple code?

In the following minimum example, the function f1 is optimized, but the function f2 is not. Is there any essential difficulty that hinders the compiler to do so? Can I make the setfield!(a, a.i, v) pattern faster?
And here is my actual code which need that pattern.

The minimum example is as following:

mutable struct A
	val::Int
	i::Int
end
function f1(v)
	a = A(v, 1)
	i = 1 # <- 
	setfield!(a, i, v)
	a.val
end
function f2(v)
	a = A(v, 1)
	# a.i = 1 # optional
	i = a.i # <- 
	setfield!(a, i, v)
	a.val
end
@code_llvm f1(1)

; Function Signature: f1(Int64)
; @ path within f1
; Function Attrs: uwtable
define i64 @julia_f1_28558(i64 signext %“v::Int64”) #0 {
top:
ret i64 %“v::Int64”
}

@code_llvm f2(1)

; Function Signature: f2(Int64)
; @ path within f2
; Function Attrs: uwtable
define i64 @julia_f2_28289(i64 signext %“v::Int64”) #0 {
top:
%jlcallframe1 = alloca [3 x ptr], align 8
%gcframe2 = alloca [5 x ptr], align 16
call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 40, i1 true)
%pgcstack = call ptr inttoptr (i64 140735906416704 to ptr)() #9
store i64 12, ptr %gcframe2, align 16
%frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
%task.gcstack = load ptr, ptr %pgcstack, align 8
store ptr %task.gcstack, ptr %frame.prev, align 8
store ptr %gcframe2, ptr %pgcstack, align 8
; @ path within f2
; ┌ @ path within A
%ptls_field = getelementptr inbounds ptr, ptr %pgcstack, i64 2
%ptls_load = load ptr, ptr %ptls_field, align 8
%“new::A” = call noalias nonnull align 8 dereferenceable(32) ptr @ijl_gc_pool_alloc_instrumented(ptr %ptls_load, i32 800, i32 32, i64 1468019289872) #7
%“new::A.tag_addr” = getelementptr inbounds i64, ptr %“new::A”, i64 -1
store atomic i64 1468019289872, ptr %“new::A.tag_addr” unordered, align 8
store i64 %“v::Int64”, ptr %“new::A”, align 8
%“new::A.i_ptr” = getelementptr inbounds i8, ptr %“new::A”, i64 8
store i64 1, ptr %“new::A.i_ptr”, align 8
%gc_slot_addr_2 = getelementptr inbounds ptr, ptr %gcframe2, i64 4
store ptr %“new::A”, ptr %gc_slot_addr_2, align 16
; └
; @ path within f2
%box_Int64 = call nonnull align 8 dereferenceable(8) ptr @ijl_box_int64(i64 signext 1) #2
%gc_slot_addr_1 = getelementptr inbounds ptr, ptr %gcframe2, i64 3
store ptr %box_Int64, ptr %gc_slot_addr_1, align 8
%box_Int642 = call nonnull align 8 dereferenceable(8) ptr @ijl_box_int64(i64 signext %“v::Int64”) #2
%gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
store ptr %box_Int642, ptr %gc_slot_addr_0, align 16
store ptr %“new::A”, ptr %jlcallframe1, align 8
%0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
store ptr %box_Int64, ptr %0, align 8
%1 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
store ptr %box_Int642, ptr %1, align 8
%jl_f_setfield_ret = call nonnull ptr @jl_f_setfield(ptr null, ptr nonnull %jlcallframe1, i32 3)
; @ path within f2
; ┌ @ Base.jl:49 within getproperty
%“new::A.val” = load i64, ptr %“new::A”, align 8
%frame.prev11 = load ptr, ptr %frame.prev, align 8
store ptr %frame.prev11, ptr %pgcstack, align 8
ret i64 %“new::A.val”
; └
}

1 Like

The compiler can’t propagate fields of mutable struct as constant unless they are declared as const. Marking the i::Int as const would allow the compiler to optimize it.

5 Likes

We are constant propagating (or concretely evaluating?) through non-const fields in f1 though, so I don’t think it’s that unreasonable to be surprised that f1 worked but f2 didn’t.

There’s not really any reason why we’d need to promise to the compiler that i is a const field since the compiler could just see and know that nothing can mutate that field and propagate through it.

3 Likes

The value of the third argument of setfield! in f1 is a literal constant but that in f2 isn’t. We would need escape analysis to prove that field remains to be the literal constant in f2.

What I mean is that in f1, the compiler is able to know that

a = A(v, 1)
setfield(a, 1, v)
a.val

returns v, which also requires the compiler to know that nothing else is modifying a.val during the function’s execution.

This means that there already is some level of analysis about what can and cannot change inside a mutable struct without explicitly using const.

3 Likes

I tried

mutable struct A
	const val::Ref{Int}
	const i::Ref{Int}
end
function f3(v)
	a = A(0, 1)
	@inbounds i = a.i[]
	@inbounds getfield(a, i)[]=v
	@inbounds a.val[]
end

It seems to take effect. The llvm code is much shorter (maybe faster than f2?) It simply ret i64 %“v::Int64” at the end just like the f1 does, but still has some other checking code.

@code_llvm f3(1)

; Function Signature: f3(Int64)
; @ path within f3
; Function Attrs: uwtable
define i64 @julia_f3_33901(i64 signext %“v::Int64”) #0 {
pass:
%gcframe1 = alloca [4 x ptr], align 16
call void @llvm.memset.p0.i64(ptr align 16 %gcframe1, i8 0, i64 32, i1 true)
%pgcstack = call ptr inttoptr (i64 140735906416704 to ptr)() #6
store i64 8, ptr %gcframe1, align 16
%frame.prev = getelementptr inbounds ptr, ptr %gcframe1, i64 1
%task.gcstack = load ptr, ptr %pgcstack, align 8
store ptr %task.gcstack, ptr %frame.prev, align 8
store ptr %gcframe1, ptr %pgcstack, align 8
; @ path within f3
call void @llvm.assume(i1 icmp ne (ptr @“+Main.Base.RefValue#33903.jit”, ptr @“+Core.GenericMemoryRef#33907.jit”))
%frame.prev35 = load ptr, ptr %frame.prev, align 8
store ptr %frame.prev35, ptr %pgcstack, align 8
; @ path within f3
; ┌ @ refvalue.jl:59 within getindex
; │┌ @ Base.jl:49 within getproperty
ret i64 %“v::Int64”
; └└
}

That’s not what @aviatesk was suggesting, you’ve just shunted the non-const-ness off by a extra layer of indirection (also Ref is an abstract type).

His suggestion was rather that you should write

mutable struct A
    val::Int
    const i::Int
end

and then your f2 will optimize away.

2 Likes

I think the a = A(0, 1) is a local variable within the f2, so nothing outside the f2 can modify it. (Right? Is there any edge case? Multi threading? I’m not sure…)
If it’s impossible theoretically or just because of the imperfection of the compiler?

Why isn’t LLVM escape analysis able to prove that A doesn’t need to exist?

5 Likes

Yes. I tried. And f2 is optimized away. But a mutable A.i is what I need and I want to make it faster.

For your type, I’d change implementation altogether to

mutable struct SVector4{T}<:AbstractVector{T}
    len::Int8
    vec::NTuple{4,T}
    SVector4{T}() where {T} = new{T}(0)
    SVector4{T}(x) where T = new{T}(1, (x, x, x, x))
    SVector4{T}(x, y) where T = new{T}(2, (x, y, y, y))
    SVector4{T}(x, y, z) where T = new{T}(3, (x, y, z, z))
    SVector4{T}(e1, e2, e3, e4) where T = new{T}(4, (e1, e2, e3, e4))
end

SVector4(e1::T, e2::T, e3::T, e4::T) where T = SVector4{T}(e1, e2, e3, e4)
function SVector4(x, y, z, q)
    e1, e2, e3, e4 = promote(x, y, z, q)
    return SVector4(e1, e2, e3, e4)
end

function Base.getindex(v::SVector4, i::Integer)
    i in Base.OneTo(v.len) || throw(BoundsError(v, i))
    return v.vec[i]
end

function Base.setindex!(v::SVector4{T}, x, i) where T
    i in Base.OneTo(v.len) || throw(BoundsError(v, i))
    v.vec = Base.setindex(v.vec, convert(T, x), i)
    return x
end

function Base.push!(v::SVector4{T}, x) where T
    v.len < 4 || error("Cannot `push!`: vector reached maximum capacity")
    y = convert(T, x)
    if isdefined(v, :vec)
        v.len += 1
        v[v.len] = y
    else
        v.vec = (y, y, y, y)
        v.len = 1
    end
    return v
end

Base.length(v::SVector4) = v.len
Base.size(v::SVector4) = (v.len,)
Base.iterate(v::SVector4, i=1) = i in Base.OneTo(v.len) ? (v[i], i+1) : nothing
Base.eltype(::Type{SVector4{T}}) where T = T
Base.empty!(v::SVector4) = (v.len = 0; v)
Base.isempty(v::SVector4) = v.len == 0

On my machine, push! for this implementation is ~40% faster than yours, indexing is a bit slower due to bounds checks, the same if checks are removed.

This looks very much like MutableSmallVector{N,T} from my package SmallCollections.jl. While the immutable SmallVector{N,T} is already in the published version, MutableSmallVector is only in master. I’m planning to publish a new release soon. If you want to try out the mutable variant, you need to say add SmallCollections#master in the Julia package manager.

julia> @code_typed f1(3)
CodeInfo(
1 ─     return v
) => Int64

julia> @code_typed f2(3)
CodeInfo(
1 ─ %1 = %new(Main.A, v, 1)::A
│   %2 = Base.getfield(%1, :i)::Int64
│        Main.setfield!(%1, %2, v)::Int64
│   %4 = Base.getfield(%1, :val)::Int64
└──      return %4
) => Int64

Yeah it really feels like getfield(a, :i) right after a = A(v, 1) could’ve been known to be 1 and propagated, the same way A.val is known to be v by the end. Strictly speaking, v wasn’t a constant either. Maybe the difference is v isn’t reassigned, but a was mutated (one of its fields was reassigned), so there’s extra work to check for effective constants that isn’t being done now? There is a balance between the amount (read: latency) of compiler optimization and trimming of the source code that is not satisfactory in all cases.

Wonderful! The compiler can optimize the Base.setindex correctly and there are no redundant copy in the last llvm code!
I can’t find the doc of Base.setindex on the doc website. Is it a public api?


My test code:

function fv1(v)
	a = SVector4(1,2,3,4)
	a[2] = 1
	a[a[2]] = v
	a[a[2]]
end

; Function Signature: fv1(Int64)
; @ path within fv1
define i64 @julia_fv1_9799(i64 signext %“v::Int64”) #0 {
L44:
; @ path within fv1
; ┌ @ path within getindex @ tuple.jl:31
ret i64 %“v::Int64”
; └
}

(I also add two @inbounds within the definition of the functions getindex and push of @Vasily_Pisarev’s SVector4)

v1.11 introduces the public keyword to mark API vs internals, so you can do reflection instead of poring over the documentation:

julia> Base.ispublic(Base, :setindex)
false

You also get a warning in help mode:

help?> Base.setindex
  │ Warning
  │
  │  The following bindings may be internal; they may change or be removed in future versions:
  │
  │    •  Base.setindex

It only says “may be internal” because some packages may not have caught up to using public, so their documentation is still the last word.

1 Like

It use the unsafe_load trick to get around the type system, which is another way to make fast small collections. I am hope for a smarter compiler instead.

That’s not what it’s for, in fact it’s documented to specifically take a pointer’s type into account. It’s just a low level function when you need to implement a type and its getindex from scratch instead of using Base types that have their own low level code. In this case, a vector with a known maximum number of elements cannot be built on Vectors with an unfixed number of elements, and compilers won’t do that for you, although mutables with known maximum sizes do share possible optimizations from escape analysis. Heap-to-stack allocation isn’t happening here though:

julia> @btime f2($3)
  23.631 ns (1 allocation: 32 bytes)
3

Not yey but soon.

2 Likes

It probably could, but this is a phase-ordering problem. In setfield(a, field, val) if field is non-const inference and codegen don’t know which field is being targeted and thus this will likely emit a runtime call. LLVM could remove A and then propagate the value there, but at that point it is too late. We would have needed to codegen a “union-split” over the possible fields.

3 Likes

I was a bit confused by this at first, but now I think I get it. The argument is basically that LLVM could understand it in theory, but by the time LLVM gets it, we’ve already inserted a runtime dispatch on the julia-side so it’s too late for LLVM to optimize it, is that right?