Why does RefValue has allocation?

How to explain that only by adding one line r[] I had one more allocation? I don’t understand.
You can neglect the packages, just focus on the function _grbval and grbval, and r. Thanks.

import JuMP, Gurobi
_grbval(o, x, r) = Gurobi.GRBgetdblattrelement(o, "X", gcc(o, x), r); # calling a C-API
function grbval(o, x, r)
    _grbval(o, x, r)
    getfield(r, :x)
end;

gcc(o, x) = Gurobi.c_column(o, JuMP.index(x));
const r = Ref{Float64}();
const m = JuMP.direct_model(Gurobi.Optimizer());
const o = JuMP.backend(m);
JuMP.@variable(m, x);
JuMP.optimize!(m)
_grbval(o, x, r)
grbval(o, x, r)
@allocated _grbval(o, x, r) # 0
@allocated grbval(o, x, r) # 0
@time _grbval(o, x, r) # 0.000008 seconds
@time grbval(o, x, r) # 0.000010 seconds (1 allocation: 16 bytes)

The question is: why does the last line shows there is 1 allocation 16 bytes.

The definition used in _grbval is

function GRBgetdblattrelement(model, attrname, element, valueP)
    ccall((:GRBgetdblattrelement, libgurobi), Cint, (Ptr{GRBmodel}, Ptr{Cchar}, Cint, Ptr{Cdouble}), model, attrname, element, valueP)
end

You’re more likely to get help if you make your example as self contained as possible.

You say I can neglect the packages, but they’re pretty intertwined with your example. If I just construct a simple analogue to your example though, I see no such allocations.

3 Likes

Yes, I cannot reproduce with simple examples either. But mine was indeed a special case.

I think there was already a function barrier, so the _grbval function should be a black box, right?

Some related details are

should I conclude that @time is misleading, after seeing

julia> @time grbval(o, x, r)
  0.000010 seconds (1 allocation: 16 bytes)
0.0

julia> using BenchmarkTools

julia> @btime grbval($o, $x, $r)
  38.031 ns (0 allocations: 0 bytes)
0.0

?

A very old section buried in the Perfomance Tips:

julia> @time sum_arg(x)
  0.007551 seconds (3.98 k allocations: 200.548 KiB, 99.77% compilation time)
523.0007221951678

julia> @time sum_arg(x)
  0.000006 seconds (1 allocation: 16 bytes)
523.0007221951678

The 1 allocation seen is from running the @time macro itself in global scope. If we instead run the timing in a function, we can see that indeed no allocations are performed:

julia> time_sum(x) = @time sum_arg(x);

julia> time_sum(x)
  0.000002 seconds
523.0007221951678

You can manually make a forwarding method like time_sum to verify that’s the case for your code. BenchmarkTools puts the input expression in a function so it has the same effect. Note that despite the exact wording of the section, this extra allocation still shows up in let blocks and other (local) scope-introducing blocks except functions:

julia> let; @time sum_arg(x) end
  0.000007 seconds (1 allocation: 16 bytes)
500.0699540425248
4 Likes

Yes, that made sense.


But I’m still confused here:

julia> f() = Ref{Cdouble}();

julia> tf() = @time f();

julia> tf()
  0.000000 seconds (1 allocation: 16 bytes)
Base.RefValue{Float64}(0.0)

julia> tf()
  0.000000 seconds (1 allocation: 16 bytes)
Base.RefValue{Float64}(0.0)

julia> using BenchmarkTools

julia> @btime f()
  8.560 ns (1 allocation: 16 bytes)
Base.RefValue{Float64}(6.41014581546824e-310)

julia> @allocated f()
0

My question is: Does the Ref{Cdouble}() action entails allocation or not?

My practical usage is the following 3-line function, in which r is a temporary object

function get_from_C_API()
    r = Ref{Cdouble}()
    a_C_api!(r) # fill in its value
    getfield(r, :x)
end

I wonder if this function has allocation or not when it is executed. Can the compiler optimize it so that there is 0 allocation?


I have no idea which helper macro should I trust, e.g. this example

julia> function f(x)
           r = Ref{Cdouble}()
           r.x > -9e100 && setindex!(x, 0, 1)
           r.x
       end
f (generic function with 1 method)

julia> x = [1];

julia> @allocated f(x)
0

Yes and no, depending on the context:

julia> @code_llvm f()
; Function Signature: f()
;  @ REPL[1]:1 within `f`
define nonnull ptr @julia_f_3678() local_unnamed_addr #0 {
top:
  %pgcstack = call ptr inttoptr (i64 4315955464 to ptr)(i64 4315955500) #7
; ┌ @ refpointer.jl:146 within `Ref`
; │┌ @ refvalue.jl:7 within `RefValue`
    %ptls_field = getelementptr inbounds nuw i8, ptr %pgcstack, i64 16
    %ptls_load = load ptr, ptr %ptls_field, align 8
    %"new::RefValue" = call noalias nonnull align 8 dereferenceable(16) ptr @ijl_gc_small_alloc(ptr %ptls_load, i32 424, i32 16, i64 4642896336) #4
    %"new::RefValue.tag_addr" = getelementptr inbounds i8, ptr %"new::RefValue", i64 -8
    store atomic i64 4642896336, ptr %"new::RefValue.tag_addr" unordered, align 8
    ret ptr %"new::RefValue"
; └└
}

See the @ijl_gc_small_alloc? That’s a memory allocation.

julia> @macroexpand @allocated f()
quote
    #= timing.jl:561 =#
    $(Expr(:meta, :force_compile))
    #= timing.jl:562 =#
    Base.allocated(f)
end

julia> @code_llvm Base.allocated(f)
; Function Signature: allocated(typeof(Main.f))
;  @ timing.jl:517 within `allocated`
define i64 @julia_allocated_4009() local_unnamed_addr #0 {
top:
;  @ timing.jl:518 within `allocated`
; ┌ @ refpointer.jl:147 within `Ref`
; │┌ @ refvalue.jl:8 within `RefValue`
    %"new::RefValue" = alloca i64, align 16
; └└
;  @ timing.jl:519 within `allocated`
; ┌ @ refpointer.jl:147 within `Ref`
; │┌ @ refvalue.jl:8 within `RefValue`
    %"new::RefValue4" = alloca i64, align 16
    call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %"new::RefValue")
; └└
;  @ timing.jl:518 within `allocated`
; ┌ @ refpointer.jl:147 within `Ref`
; │┌ @ refvalue.jl:8 within `RefValue`
    store i64 0, ptr %"new::RefValue", align 16
    call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %"new::RefValue4")
; └└
;  @ timing.jl:519 within `allocated`
; ┌ @ refpointer.jl:147 within `Ref`
; │┌ @ refvalue.jl:8 within `RefValue`
    store i64 0, ptr %"new::RefValue4", align 16
; └└
;  @ timing.jl:520 within `allocated`
; ┌ @ timing.jl:509 within `gc_bytes`
   call void @jlplt_ijl_gc_get_total_bytes_4013_got.jit(ptr nonnull %"new::RefValue")
; └
;  @ timing.jl:522 within `allocated`
; ┌ @ timing.jl:509 within `gc_bytes`
   call void @jlplt_ijl_gc_get_total_bytes_4013_got.jit(ptr nonnull %"new::RefValue4")
; └
;  @ timing.jl:523 within `allocated`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base_compiler.jl:57 within `getproperty`
    %"new::RefValue4.x" = load i64, ptr %"new::RefValue4", align 8
    %"new::RefValue.x" = load i64, ptr %"new::RefValue", align 8
; └└
; ┌ @ int.jl:86 within `-`
   %0 = sub i64 %"new::RefValue4.x", %"new::RefValue.x"
; └
  ret i64 %0
}

@allocated f() is actually calling Base.allocated(f), and inside that function the compiler replaced the heap allocation with a stack-based alloca

4 Likes

The problem with benchmarking small “isolated” code snippets is that they’re isolated. They may behave differently when inlined somehwere else, or constants are propagated, or something, and the compiler finds or loses opportunities for optimizing.

6 Likes