Why does arrayref throw?

This is a small example (initiate part of a vector with every second element from another vector) that demonstrates(?) that the check is not always completely for free.
I’m not claiming this is an important case. I’m not even sure the code is correct (or that my use of BenchmarkTools is correct).

using BenchmarkTools

@inline function unsafe_get(a::Vector{Vector{Int}}, i::Int)
    x = unsafe_load(Ptr{Ptr{Int}}(pointer(a, i)))
    ccall(:jl_value_ptr, Vector{Int}, (Ptr{Cvoid},), x) # unsafe_pointer_to_objref()
end

function f_a!(dst::Vector{Vector{Int}}, src::Vector{Vector{Int}})
    for i = 1:div(length(src), 2)
        @inbounds dst[i] = src[2i]
    end
    nothing
end

function f_b!(dst::Vector{Vector{Int}}, src::Vector{Vector{Int}})
    for i = 1:div(length(src), 2)
        @inbounds dst[i] = unsafe_get(src, 2i)
    end
    nothing
end

function bench(n::Int)
    a = [[i] for i=1:2n]
    b = similar(a)

    ta = @belapsed f_a!($b, $a)
    tb = @belapsed f_b!($b, $a)
    ta/tb, round(ta, sigdigits=3), round(tb, sigdigits=3)
end

for n=200:20:300
    @show n, bench(n)
end

On my machine I get the following results (with Julia 1.9.3 and --optimize=3)

(n, bench(n)) = (200, (1.0590968621611438, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.114640410958904, 2.85e-7, 2.55e-7))
(n, bench(n)) = (240, (1.0520847573479153, 2.93e-7, 2.79e-7))
(n, bench(n)) = (260, (1.1019163763066202, 3.61e-7, 3.28e-7))
(n, bench(n)) = (280, (1.049052479029267, 3.42e-7, 3.26e-7))
(n, bench(n)) = (300, (1.085224878870866, 3.89e-7, 3.58e-7))

I don’t see anything wrong about the benchmark, but you’re seeing a very small difference there in the minimum time, which is subject to variation across different benchmarks (you can see it across your printout lines) because the machine’s performance varies for many reasons over time. So it’s important to look at the distribution that @benchmark shows, as it captures more of the machine variation (but not over longer time scales).

Running this on my machine, I got an almost consistent difference the other way around. I’m sure if I keep going, I can get something like your result just by pure chance.

julia> for n=200:20:300
           @show n, bench(n)
       end
(n, bench(n)) = (200, (0.9910646693459995, 1.34e-7, 1.35e-7))
(n, bench(n)) = (220, (0.9954595955681143, 1.46e-7, 1.47e-7))
(n, bench(n)) = (240, (0.9914619076992319, 1.58e-7, 1.6e-7))
(n, bench(n)) = (260, (0.9721958890266656, 1.74e-7, 1.79e-7))
(n, bench(n)) = (280, (0.9884146378915103, 1.83e-7, 1.85e-7))
(n, bench(n)) = (300, (1.1108821051267095, 2.39e-7, 2.15e-7))

Looking at @benchmark, the full distributions are not significantly different, the spread is much larger than the variations in minimum times above:

julia> @benchmark f_a!($b, $a)
BenchmarkTools.Trial: 10000 samples with 875 evaluations.
 Range (min … max):  133.383 ns … 59.799 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     136.080 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   292.160 ns ±  2.918 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅▅▃▁▁▁     ▁▂▂▁▁▁                                          ▁
  █████████████████████▇█▇▆▆▅▆▆▆▆▇▅▆▇▆▆▅▆▄▅▆█▇▆▆▆▆▆▆▅▆▅▅▆▂▂▅▅▄ █
  133 ns        Histogram: log(frequency) by time       240 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f_b!($b, $a)
BenchmarkTools.Trial: 10000 samples with 873 evaluations.
 Range (min … max):  135.006 ns … 59.885 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     138.683 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   295.075 ns ±  2.934 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▅▅▃▁▁▁   ▁▂▂▂▁▁▁                   ▂                      ▂
  ███████████████████████▇▇▆▅▆▅▆▆▇▅▅▆▅▆██▇▆▆▅▆▇▄▅▆▆█▇▄▅▅▅▅▄▅▅▆ █
  135 ns        Histogram: log(frequency) by time       230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

Thanks for the brief on BenchmarkTools!

Actually, the numbers I get from @belapsed appears to be stable. The ratio always seem to land close to either 1.05 or to 1.10; the numbers I posted were not cherry-picked.

For example, changing to

for i = 1:10
    @show i
    for n=200:20:300
        @show n, bench(n)
    end
end

I get

i = 1
(n, bench(n)) = (200, (1.059105419604759, 2.47e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1169674264306346, 2.85e-7, 2.55e-7))
(n, bench(n)) = (240, (1.069635126777984, 3.07e-7, 2.87e-7))
(n, bench(n)) = (260, (1.115670074219297, 3.35e-7, 3.0e-7))
(n, bench(n)) = (280, (1.0594082334289328, 3.45e-7, 3.26e-7))
(n, bench(n)) = (300, (1.103136412176504, 3.87e-7, 3.51e-7))
i = 2
(n, bench(n)) = (200, (1.0582788259641385, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1114801668710947, 2.85e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0708995589581427, 3.1e-7, 2.9e-7))
(n, bench(n)) = (260, (1.1171171171171173, 3.35e-7, 3.0e-7))
(n, bench(n)) = (280, (1.0556340121896834, 3.41e-7, 3.23e-7))
(n, bench(n)) = (300, (1.1086707410236822, 3.86e-7, 3.48e-7))
i = 3
(n, bench(n)) = (200, (1.057655254187046, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1151592455163886, 2.85e-7, 2.55e-7))
(n, bench(n)) = (240, (1.0550129230029452, 3.08e-7, 2.92e-7))
(n, bench(n)) = (260, (1.115670074219297, 3.35e-7, 3.0e-7))
(n, bench(n)) = (280, (1.055371768699117, 3.41e-7, 3.23e-7))
(n, bench(n)) = (300, (1.0986128024567163, 3.88e-7, 3.53e-7))
i = 4
(n, bench(n)) = (200, (1.0596853679833065, 2.47e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1106500691562933, 2.85e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0704536925477495, 3.1e-7, 2.9e-7))
(n, bench(n)) = (260, (1.1172220085544793, 3.36e-7, 3.01e-7))
(n, bench(n)) = (280, (1.0602796213156354, 3.42e-7, 3.23e-7))
(n, bench(n)) = (300, (1.1026798679867988, 3.9e-7, 3.54e-7))
i = 5
(n, bench(n)) = (200, (1.0583630967352484, 2.47e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1109747721467211, 2.84e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0617967276929627, 3.09e-7, 2.91e-7))
(n, bench(n)) = (260, (1.1148175429981175, 3.36e-7, 3.01e-7))
(n, bench(n)) = (280, (1.0604426839779206, 3.43e-7, 3.23e-7))
(n, bench(n)) = (300, (1.099881188118812, 3.89e-7, 3.54e-7))
i = 6
(n, bench(n)) = (200, (1.0554535294204532, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1143280519776586, 2.85e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0622056764668448, 3.1e-7, 2.92e-7))
(n, bench(n)) = (260, (1.1133150934792522, 3.35e-7, 3.01e-7))
(n, bench(n)) = (280, (1.0656216229644626, 3.44e-7, 3.23e-7))
(n, bench(n)) = (300, (1.109878305889888, 3.87e-7, 3.48e-7))
i = 7
(n, bench(n)) = (200, (1.0540566398775353, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1132480092307926, 2.85e-7, 2.56e-7))
(n, bench(n)) = (240, (1.066147859922179, 3.09e-7, 2.89e-7))
(n, bench(n)) = (260, (1.1114110244264885, 3.36e-7, 3.02e-7))
(n, bench(n)) = (280, (1.0601851851851853, 3.46e-7, 3.26e-7))
(n, bench(n)) = (300, (1.0902104780014426, 3.9e-7, 3.58e-7))
i = 8
(n, bench(n)) = (200, (1.0571566797981893, 2.46e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1163452772756068, 2.85e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0534270759129059, 3.1e-7, 2.95e-7))
(n, bench(n)) = (260, (1.1091549091247044, 3.36e-7, 3.03e-7))
(n, bench(n)) = (280, (1.0609731926069073, 3.45e-7, 3.25e-7))
(n, bench(n)) = (300, (1.0962927618063012, 3.88e-7, 3.54e-7))
i = 9
(n, bench(n)) = (200, (1.060763288366194, 2.47e-7, 2.33e-7))
(n, bench(n)) = (220, (1.1128090925961693, 2.84e-7, 2.56e-7))
(n, bench(n)) = (240, (1.0528404658610353, 2.94e-7, 2.79e-7))
(n, bench(n)) = (260, (1.1114110244264885, 3.36e-7, 3.02e-7))
(n, bench(n)) = (280, (1.0486630148609362, 3.43e-7, 3.27e-7))
(n, bench(n)) = (300, (1.07691203902999, 3.91e-7, 3.63e-7))
i = 10
(n, bench(n)) = (200, (1.055011655011655, 2.46e-7, 2.34e-7))
(n, bench(n)) = (220, (1.106808983977469, 2.85e-7, 2.58e-7))
(n, bench(n)) = (240, (1.0547503571697119, 2.96e-7, 2.81e-7))
(n, bench(n)) = (260, (1.107662103055896, 3.36e-7, 3.03e-7))
(n, bench(n)) = (280, (1.0569057789561545, 3.43e-7, 3.25e-7))
(n, bench(n)) = (300, (1.0881835847136114, 3.88e-7, 3.57e-7))

The more low-tech benchmark

function bench(f!::Function, n::Int, m::Int)
    a = [[i] for i=1:2n]
    b = similar(a)

    for i = 1:m
        f!(b, a)
    end
    return
end

@time bench(f_a!, 300, 10_000_000)
@time bench(f_a!, 300, 10_000_000)
@time bench(f_a!, 300, 10_000_000)

@time bench(f_b!, 300, 10_000_000)
@time bench(f_b!, 300, 10_000_000)
@time bench(f_b!, 300, 10_000_000)

gives me

  3.242679 seconds (602 allocations: 47.125 KiB)
  3.230558 seconds (602 allocations: 47.125 KiB)
  3.236618 seconds (602 allocations: 47.125 KiB)
  3.023927 seconds (602 allocations: 47.125 KiB)
  3.019529 seconds (602 allocations: 47.125 KiB)
  3.027147 seconds (602 allocations: 47.125 KiB)

I ran all this from scripts, like julia --project=. --optimize=3 benchmark.jl, while I notice that you ran from the REPL. Can this make any difference?
Perhaps more likely is that the difference is caused by difference in hardware and/or in Julia version.

From (in the REPL)

function bench(f::Function, n::Int)
    a = [[i] for i=1:2n]
    b = similar(a)

    @benchmark $f($b, $a)
end

bench(f_a!, 200)
bench(f_b!, 200)

I have

julia> bench(f_a!, 200)
BenchmarkTools.Trial: 10000 samples with 390 evaluations.
 Range (min … max):  245.897 ns … 596.667 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     247.179 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   252.319 ns ±   9.886 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▅▃         ▁▃▂ ▁▂▁               ▃▆▆▂▂        ▂            ▂
  █████▇▅▄▅▅▄▃▃▇███▆███▆▅▃▅▄▇█▆▆▄▁▄▄▄▇██████▆▄▄▄▄▆▇████▇█▆▅▄▄▆▄ █
  246 ns        Histogram: log(frequency) by time        273 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> bench(f_b!, 200)
BenchmarkTools.Trial: 10000 samples with 460 evaluations.
 Range (min … max):  233.043 ns … 475.435 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     234.130 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   239.150 ns ±   9.005 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▄            ▂▃▁▁     ▁       ▂▆▆▃▂▁          ▂▁           ▂
  ████▇▆▄▄▃▁▁▁▁▄▁▅████▅▄▅▇███▆▆▅▄▄▆██████▇▅▅▅▄▅▆▇▇▇███▇▅▁▇▇█▆▆▆ █
  233 ns        Histogram: log(frequency) by time        257 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Interesting, I had expected that by chance, your comparative benchmarks would swap numbers and go the other way, e.g. if 2.46e-7, 2.33e-7 swapped with 3.1e-7, 2.95e-7 to make 2.46e-7, 2.95e-7. For fun, you could swap the order of the benchmarks in bench, but still calculate ta/tb, see if the difference persists.

The possibility that is consistent with indistinguishable performance is that the machine’s performance happens to vary periodically in sync with the benchmark lengths, but that’s unlikely.

This seems more reasonable. When we’re talking about <2x performance differences, it can often be outweighed and reversed by other non-random factors like this. My benchmark timings were already faster to begin with, and the distributions’ shapes look different, so the conditions can’t be the same.

Figured it out, a package, UnsafeAssume is being registered. Example code for preventing undefined reference checks using the new package:

function unsafe_array_access(a)
  @inline begin
    unsafe_assume_condition(isassigned(a, 1))
    @inbounds a[1]
  end
end

EDIT:

Without thinking about the assembly generated (see my repo on Gitlab) for unsafe_array_access carefully:

  1. It doesn’t include any branches or other conditional operations

  2. the only instructions are push, mov, mov, mov, pop, ret

While I’m only guessing, I’d say the code is optimal (assuming no undefined references).

3 Likes

Yes, sure. Tried it. The results did not change in any noticeable way.

Thanks for making this package!

From README.md · 52b79427f84619a4ebcbdf30069df1216d129e0f · Neven Sajko / UnsafeAssume.jl · GitLab

The calls of UnsafeAssume functions, along with all the code that establishes the relevant assumptions, together with the code where the applying the assumptions is necessary for performance, should be with an @inline block. This ensures all relevant information is available to the LLVM optimizer when the optimizer needs it.

So, I gather the use of call site inline is important here. Is it possible to explain why it is needed?
And, should I be using it in other places too (unrelated to unsafe_assume_condition), to enable some LLVM optimizations? Examples?

1 Like

I’m not well-versed in LLVM, but I suppose it comes down to the fact that compiler optimizer passes are divided into interprocedural (between procedures) and intraprocedural (within a procedure). Intraprocedural passes includes optimizations that would be cost-prohibitive (in compilation time) as part of interprocedural optimization, because a lot of the problems are NP-complete, so decreasing problem size prevents an explosion of compilation time costs.

Inlining literally transforms interprocedural problems into intraprocedural problems, because what it does is include the body of a procedure into another procedure, thus enabling new optimizations. This is why inlining is called an enabling transformation.

So in this specific example, (I suppose) we need inlining to enable the dead code elimination of:

  1. the code that establishes the assumptions, isassigned(a, 1) in this case
  2. the code that we wanted to eliminate in the first place, namely the undefined reference check

I suppose the relevant optimizer passes simply don’t see the information relevant for inferring that some code may be eliminated if all relevant code is not within the same procedure.

The Julia compiler should usually be able to take care of everything more or less on its own. The aspect that’s complicating things here for Julia is that, to implement the UnsafeAssume functions, it’s necessary to use inline (LLVM) assembly: llvmcall. I think that when the Julia compiler sees an inline assembly call, the assembly call is basically opaque to the compiler. Among other things (like giving up on effect inference), Julia by default doesn’t like inlining llvmcall calls. You can see that in the UnsafeAssume readme, in the sqrt example section:

julia> Base.print_statement_costs(stdout, unsafe_sqrt, Tuple{Float64})
unsafe_sqrt(x) @ Main REPL[2]:1
    0 1 ─ %1  = Base.lt_float(_2, 0.0)::Bool
    0 │   %2  = Base.not_int(%1)::Bool
    0 └──       goto #3 if not %2
    0 2 ─       goto #4
 1000 3 ─       (Core.Intrinsics.llvmcall)("unreachable", UnsafeAssume.Cvoid, Tuple{})::Nothing

The numbers on the left are the inlining “costs”, which is how the Julia compiler models what should be inlined and what shouldn’t, by default. You can see that llvmcall has a huge inlining cost, so the Julia authors wanted to prevent inlining llvmcall by default.

It might make sense for something like unsafe_assume_condition to be included into Julia as an intrinsic. This could presumably make using it less finicky, and it could also, hypothetically, enable Julia-level optimizations, in addition to LLVM-level optimizations. I suppose the Julia authors would only consider including an intrinsic like that into the Julia implementation if the UnsafeAssume package proves to be useful and popular, to justify their costs.

An alternative approach may be to teach Julia about LLVM instructions and intrinsics like unreachable and llvm.assume when they appear in inline assembly.

5 Likes

OK, that makes sense! Thanks for the clear and detailed write-up!

1 Like

For posterity, this got changed in codegen: remove UB from uninitialized bitstypes in new by vtjnash · Pull Request #52169 · JuliaLang/julia · GitHub.

2 Likes