Rules when box/unbox applies

JuergenD · November 19, 2020, 9:10pm

Aiming to be good citizen, I’ve followed PSA Make it easier to help you and prepared an, as much as possible, stripped down example

Not sure if that matters, I’m on Win10 using Julia 1.5.3

abstract type AbstractField{T} end

pos(this::AbstractField) ::UnitRange{Int} = this.pos
pos(this::AbstractField, start::Int) ::UnitRange{Int} = (start -= 1; start + pos(this).start : start + pos(this).stop)
start_pos(this::AbstractField) ::Int = pos(this).start
length(this::AbstractField) ::Int = pos(this).stop - pos(this).start + 1
get_bytes(this::AbstractField, buf::AbstractVector{UInt8}, start::Int=1) = view(buf, pos(this, start))

# -----------------------------------------------------------------------------
struct AsciiField <: AbstractField{AbstractString}
    name::AbstractString
    pos::UnitRange{Int}
end

# --------------------------------------------------------------------
struct Record
    fields::Vector{AbstractField}
end

get_field(this::Record, idx::Union{AbstractString,Int}) = this.fields[idx]

# --------------------------------------------------------------------
struct FWFile
    reclen::Int
    buf::Vector{UInt8}
    recdef::Record
end

function FWFile(; recdef::Record)
    buf = repeat(b"x", 1_000_000_000)
    return FWFile(105, buf, recdef)
end

function get_field(file::FWFile, idx::Union{AbstractString,Int})
    return get_field(file.recdef, 1)
end

# --------------------------------------------------------------------
values_2(this::FWFile, field::AbstractString) = values_2(this, get_field(this, field))

function values_2(this::FWFile, field::AbstractField)
    start::Int = start_pos(field)
    reclen::Int = this.reclen
    stop::Int = sizeof(this.buf) - reclen
    range = start : reclen : stop
    flen::Int = length(field)
    @show range
    return (get_bytes(field, this.buf, i) for i in range)
end

function iter_loop(file::FWFile, field::Union{AbstractField,AbstractString})
    for x in values_2(file, field)
        # empty
    end
end

function my_test()
    f1 = AsciiField("A", 1:10)
    rec = Record([f1])
    fwf = FWFile(recdef=rec)

    @time iter_loop(fwf, f1)                    # (1)
    @time iter_loop(fwf, get_field(fwf, "A"))   # (2)
    @time iter_loop(fwf, "A")                   # (3)

    GC.gc()
end

The output on my laptop looks as follows

julia> my_test()
range = 1:105:999999841
  0.105971 seconds (59.44 k allocations: 3.010 MiB)
range = 1:105:999999841
  0.037077 seconds (37 allocations: 1.531 KiB)
range = 1:105:999999841
  3.323590 seconds (38.10 M allocations: 1.419 GiB, 14.19% gc time)

(1) I fully understand why this version can be best optimized/spezialised by a compiler
(2) Can only be optimized at runtime, but obviously Julia does well and specializes them
(3) From my point of view, not really different to (2). I would think even easier to optimize. But something is clearly wrong here. I must have crossed some borderline, but I don’t know what that is.

My question: I did not find documentation or blogs that explain the rules (or maybe I didn’t understand them) if and when boxing/unboxing applies, or specialisation isn’t possible, leading to more allocations. I did read about function barriers, but as you can see the functions are really tiny, I’m not re-using local variables, and pretty everthing is typed. If somebody could please help me and explain to me the exact rules, or point me to the right documentation, would be very much appreciated.

tomerarnon · November 19, 2020, 10:23pm

Does your loop in iter_loop literally do nothing? What are you timing then? Also you should check out @btime from BenchmarkTools.

What it looks to me is that the first two versions of iter_loop are calling the signature

iter_loop(::FWFile, ::AsciiField)

and the third version is calling

iter_loop(::FWFile, ::String)

In the first two versions, values_2 can specialize on AsciiField, and the whole loop can be better optimized. In the 3rd case this isn’t possible. It’s true that the second case is doing dynamic dispatch, but it’s only doing it once after it gets the value of get_field. The function barrier there helps substantially.

Rather than the version of iter_loop you have now for AbstractString, having a separate method that just forwards to the AbstractField version can provide you with a simpler and more general function barrier.

JuergenD · November 20, 2020, 8:31am

Does your loop in iter_loop literally do nothing?

It does something useful in our real code, but best practices described in PSA Make it easier to help you suggest to strip examples down as much as possible. And the iter_loop body is not relevant for my question.

What it looks to me is that the first two versions of iter_loop are calling the signature
iter_loop(::FWFile, ::AsciiField)
and the third version is calling
iter_loop(::FWFile, ::String)

Well, yes and no. The 3rd option is calling iter_loop(::FWFile, ::String) which after resolving the String calls iter_loop(::FWFile, ::AbstractField). And like option 2, it should be able to specialise to iter_loop(::FWFile, ::AsciiField). Option 2 and 3 are only marginally different.

Also you should check out @btime from BenchmarkTools

Yes I did. How exactly does it help to answer my questions?

tomerarnon · November 20, 2020, 10:29am

This is not true, unfortunately. In the best case, you would hope that constant propagation would figure this out, but in general it is not possible. The method called is only the first. See my addition below to how to get the effect you describe.
Note that the string "A" could just as well be "B" which will not be an AsciiField when it is evaluated in get_field, and type inference would have no idea whether these should lead to the same code or not. We hope that the compiler sees that the string “A” is a compile-time constant and therefore figure the type it leads to, but that seems to be too complicated considering recdef is mutable (maybe in a much stricter context it could do it). It therefore gives up on type inference and goes with something safe (and slow).

To run the benchmark, I changed iter_loop to accumulate whatever values are in the field so that I’m not timing a do-nothing function (just in case)

function iter_loop(file::FWFile, field::Union{AbstractField,AbstractString})
    acc = 0x00
    for x in values_2(file, field)
        acc += sum(x)
    end
    acc
end

To demonstrate the function barrier, I also added this method:

iter_loop_barrier(file::FWFile, field::AbstractString) = iter_loop(file, get_field(file, field))

And added it to your benchmark:

function my_test()
    f1 = AsciiField("A", 1:10)
    fwf = FWFile(recdef = Record([f1]))

    @time iter_loop(fwf, f1)                    # (1)
    @time iter_loop(fwf, get_field(fwf, "A"))   # (2)
    @time iter_loop(fwf, "A")                   # (3)
    @time iter_loop_barrier(fwf, "A")           # (4)
end

Here is the time in julia 1.5:

julia> FunctionBarrierMod.my_test()
  0.638949 seconds
  0.133524 seconds (2 allocations: 48 bytes)
  8.238473 seconds (57.14 M allocations: 1.703 GiB, 6.99% gc time)
  0.937081 seconds (2 allocations: 48 bytes)

In 1.6, the difference is even more grave (because everything else got much faster!)

julia> FunctionBarrierMod.my_test()
  0.184903 seconds
  0.144452 seconds (2 allocations: 48 bytes)
  6.162244 seconds (57.14 M allocations: 1.703 GiB, 5.97% gc time)
  0.236389 seconds (2 allocations: 48 bytes)

By the way, this function seems to be giving the compiler a hard time beyond just the end-result being slow (that is, it’s taking a really long time to compile). The benchmark, as well as whatever intuition I have about julia, suggests that’s due to the Union{AbstractString, AbstractField} definitions. Defining everything for AbstractField and then forwarding to those methods with wrappers should be simpler all around.

Yes, the new method does the same thing as version 2, but the point is I get to call it like version 3, and it’s about as fast as version 1, without relying on constant folding that may or may not happen.

I don’t recall any resources about function barriers off the top of my head, but I know they’re out there. I remember reading an old Tim Holy article about indexing that talked about them in some detail. Understanding type stability and how and when dispatch happens is also useful, and covered in the manual. I’ll look for a source.

As for BenchmarkTools, it’s a good idea to use because the timing is more reliable. It wasn’t an answer to your question, just a tip
I still used @time above to keep things consistent and because once you get to milliseconds, @time is “fine”.

JuergenD · November 20, 2020, 7:28pm

Thanks a lot for the explanation. Is there some compiler flag or tooling that prints info or warning when this happens. I mean when the compiler is not able to generate efficient code?

tomerarnon · November 20, 2020, 7:49pm

Yes! E.g. @code_warntype my_test() (you can call it on any julia expression). Everything printed in red is “bad” (not concretely typed).

Edit: I should amend to say two things:

Tools like ProfileView.jl are useful for this also, as they highlight type instabilities in the stack with a red color.
Type instability is not inherently bad, and can be just fine if used judiciously. What there isn’t a tool for examining is whether a function is “easy for the compiler” or “hard for the compiler”. That sort of functionality might exist some day, but at the moment it doesn’t

JuergenD · November 21, 2020, 8:57pm

Thanks a lot for your time and effort.