Making @benchmark outputs statistically meaningful, and actionable

While it’s possible (though IMO relatively unlikely) that there’s stuff like what Emery discussed in the video going on (I didn’t click your link because I already watched the video last year), I feel it’s important to note that just because you didn’t do anything consciously on your machine, doesn’t mean your machine wasn’t doing things in the background.

Ultimately though, yeah the way benchmarking is done in julia does have the potential for large systematic errors I agree, and it’d be great if we got better at characterizing or eliminating those errors as they can sometimes mislead people in the optimization process, I think there’s also some quite good evidence that we’re not being totally mislead by our benchmarks as evidenced by the general portability of a lot of the benchmarking results people share even across architectures (and in the best cases, there there’s some additional human reasoning going on that’s guiding the optimization process instead of just alchemy).

Yes, if someone shows you that they managed to shave 5% off a micro-benchmark, there’s always the possibility that that’s a totally non-reproducible quirk of their benchmarking process, but once we start getting into order of magnitude improvements, especially by iteratively layering improvements, I get a lot less antsy.

This is all quite relevant though to how we try and teach people to optimize code. There’s a ton of alchemy that starts happening out there where newbies start trying to learn from what experts are doing, and the main thing they learn is that they should sprinkle about magic macros like @inbounds, @simd, @inline, etc. instead of actually thinking about what the code is doing and if there’s a way to do that faster.

8 Likes

@user664303 The video of address-layout effects on benchmarks you linked is well-known. However, there are many well-known weird things that affect benchmarks.

You have not meaningfully established that the thing that fouls up your benchmarks is specifically address layout.

If you want to do that, I would suggest that you grab addresses, and then correlate them with timings. If you end up with plots that show strong correlation between performance and e.g. the last 14 bits of some relevant function pointer, or some relevant large array address, then we’re cooking (typically one would next brainstorm for confounders, i.e. things like e.g. system-wide memory pressure or cpu pressure that could affect both address layout and performance).

Until then, you have a mystery, and I see no grounds to jump to the conclusion that “address layout” is the specific culprit. It is a plausible reason, and it is easy enough to test (take pointers, store them alongside timings, correlate). Once you have correlation, you can probably take a shot at causality as well.

7 Likes

I’ve posted a couple of graphs which show the change now, so you can see how the timing changes. It’s very different from any kind of heat/CPU usage effect. Not really sure what you want here.

I’d already mentioned it didn’t have the same effect. But you can also see that from the first boxplot graph I’ve posted.

No. Unfortunately I also note on the Stabilizer github page: “This project is no longer being actively maintained, and only works on quite old versions of LLVM.” What you be good is a having some C++ code that can be wrapped in a jll that can be called at any time and will do all the things that Stabilizer does. This could then be called between evaluations in @benchmark

Maybe the only way to actually check if a modification improved performance (on that level of precision) is to do what we are doing here, running the benchmark many times. What would it mean for the user if a tool stabilized the benchmarks, in terms of actual code performance, if they cannot control that during actual package usage?

1 Like

There’s no data to justify this conclusion.

Are you running an unposted variant of lmiq’s boxplot benchmarks? None of those used setup=(data = MyData()). It’s also not clear exactly which variant is being used for which graph, so it’d be good to include those, could be in a collapsed block to not obscure the graphs.

Yes this was mentioned earlier. Unfortunately updating Stabilizer would be a project, let alone adapting it to more languages.

Checking if a modification improved performance doesn’t require more evaluations. It requires that samples be better distributed in the space of code layout and data layout. Code address offsets, stack address offsets, different heap address offsets and different offsets between heap addresses. That’s all.

The benchmark would show a broader distribution. In practice, the run times would not be distributed that way, but would be within that distribution. The important thing about the distributions is that allows you to confirm the statistical significance of a change in general, not given a specific memory layout, which you cannot account for.

1 Like

There is. It’s in this thread. Look at my box plots. Then look at the box plot when @lmiq watched a you tube video. Totally different pattern.

They didn’t. But my first graph ran 20 times with different data. No change.

Two different behaviors by two different people on two different machines, I can’t really call that an experimental result.

Well the point of setup= is to change the input data between samples within a benchmark similar to Stabilizer’s heap randomization, not just between the 20 benchmarks. First graph suggests it won’t be interesting to apply it when you weren’t reloading the file, but it could be applied to benchmarks where you are. My thinking is if that samplewise setup eliminates the effect, your experiment would justify heap layout as the cause.

It’d be misleading as an absolute measurement of 1 benchmark due to the layout randomization overhead, but it’s intended for comparisons between 2 benchmarks of 2 program variants, so that you can eliminate memory layout artifacts as a cause for performance improvements or regressions. There’s no program editing in this case, but the randomization could be repurposed for a skewed benchmark consistency.

1 Like

For you:

using StatsPlots, BenchmarkTools

plt = boxplot()
for i in 1:20
    include("/user.jl")
    data = MyData()
    b = @benchmark evalfunc(data) setup=(data=MyData()) seconds=15
    boxplot!(plt, [i], b.times, label=nothing)
    display(plt)
end
boxplot!(plt, xticks=(1:20, 1:20))

1 Like

If these effects were due to address alignment, wouldn’t we expect these results to be more OS dependent, and less CPU make/model dependent?

(That’s an honest question by the way, not rhetorical)

4 Likes

For what it’s worth, here’s what I see:

Definitions
# script.jl
using StaticArrays, Static, BenchmarkTools, Random
import NLLSsolver

# Description of BAL image, and function to transform a landmark from world coordinates to pixel coordinates
struct BALImage{T}
    pose::NLLSsolver.EffPose3D{T}   # Extrinsic parameters (rotation & translation)
    sensor::NLLSsolver.SimpleCamera{T} # Intrinsic sensor parameters (just a single focal length)
    lens::NLLSsolver.BarrelDistortion{T} # Intrinsic lens parameters (k1 & k2 barrel distortion)
end
NLLSsolver.nvars(::BALImage) = static(9) # 9 DoF in total for the image (6 extrinsic, 3 intrinsic)
function NLLSsolver.update(var::BALImage, updatevec, start=1)
    return BALImage(NLLSsolver.update(var.pose, updatevec, start),
                    NLLSsolver.update(var.sensor, updatevec, start+6),
                    NLLSsolver.update(var.lens, updatevec, start+7))
end
function BALImage(rx::T, ry::T, rz::T, tx::T, ty::T, tz::T, f::T, k1::T, k2::T) where T<:Real
    return BALImage{T}(NLLSsolver.EffPose3D(NLLSsolver.Pose3D(rx, ry, rz, tx, ty, tz)), 
                       NLLSsolver.SimpleCamera(f), 
                       NLLSsolver.BarrelDistortion(k1, k2))
end
BALImage(v) = BALImage(v[1], v[2], v[3], v[4], v[5], v[6], v[7], v[8], v[9])
transform(im::BALImage, X::NLLSsolver.Point3D) = NLLSsolver.ideal2image(im.sensor, NLLSsolver.ideal2distorted(im.lens, -NLLSsolver.project(im.pose * X)))

# Residual that defines the reprojection error of a BAL measurement
struct BALResidual{T} <: NLLSsolver.AbstractResidual
    measurement::SVector{2, T} # Landmark image coordinates (x, y)
    varind::SVector{2, Int} # Variable indices for the residual (image, landmark)
end
BALResidual(m, v) = BALResidual(SVector{2}(m[1], m[2]), SVector{2, Int}(v[1], v[2]))
NLLSsolver.ndeps(::BALResidual) = static(2) # Residual depends on 2 variables
NLLSsolver.nres(::BALResidual) = static(2) # Residual vector has length 2
NLLSsolver.varindices(res::BALResidual) = res.varind
NLLSsolver.getvars(res::BALResidual{T}, vars::Vector) where T = vars[res.varind[1]]::BALImage{T}, vars[res.varind[2]]::NLLSsolver.Point3D{T}
NLLSsolver.computeresidual(res::BALResidual, im::BALImage, X::NLLSsolver.Point3D) = transform(im, X) - res.measurement
const balrobustifier = NLLSsolver.HuberKernel(2.)
NLLSsolver.robustkernel(::BALResidual) = balrobustifier
Base.eltype(::BALResidual{T}) where T = T

function makeBALproblem()
    Random.seed!(1)
    problem = NLLSsolver.NLLSProblem(Union{BALImage{Float64}, NLLSsolver.Point3D{Float64}}, BALResidual{Float64})
    numcameras = 16
    for cam in 1:numcameras
        NLLSsolver.addvariable!(problem, BALImage(0., 0., 0., 0., 0., 0., 1., 0., 0.))
    end
    for lm in 1:22106
        NLLSsolver.addvariable!(problem, NLLSsolver.Point3D(0., 0., 1.))
    end
    for cam in 1:numcameras
        for lm in 1:22106
            if rand() < 0.25
                NLLSsolver.addcost!(problem, BALResidual(SVector(0., 0.), SVector(cam, lm + numcameras)))
            end
        end
    end
    return problem
end

struct MyData
    vars::Vector{Union{BALImage{Float64}, NLLSsolver.Point3D{Float64}}}
    costs::NLLSsolver.VectorRepo{BALResidual{Float64}}
    linsystem::NLLSsolver.MultiVariateLSsparse
end
MyData(problem) = MyData(problem.variables, problem.costs, NLLSsolver.makesymmvls(problem, trues(length(problem.variables)), length(problem.variables)))
MyData() = MyData(makeBALproblem())

function evalfunc(data)
    NLLSsolver.zero!(data.linsystem)
    NLLSsolver.costgradhess!(data.linsystem, data.vars, data.costs)
end

data = MyData()
Invocation
julia> using StatsPlots

julia> begin
           plt = boxplot()
           for i in 1:20
               include("./script.jl")
               b = @benchmark evalfunc($data)
               boxplot!(plt, [i], b.times, label=nothing)
               display(plt)
           end
           boxplot!(plt, xticks=(1:20, 1:20))
       end 
Version info
julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a* (2023-08-24 14:43 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6
3 Likes

It would appear that randomizing the heap doesn’t eliminate the disparate timings. The higher 3rd run from the previous reloading runs is gone (strange that it’s the 3rd both times), but that disappearance could be by chance. I wonder if samplewise randomization of function addresses can be done by putting include("/user.jl") in the setup expression. Might run into world age issues so maybe data = @invokelatest MyData() could help there.

2 Likes

I’m not disbelieving that address-layout is a real issue in benchmarks, nor that it would be worthwhile to improve the tooling situation around this.

But your evidence that your specific performance mystery is caused by this specific issue is not conclusive. In fact no evidence including only performance measurements without correlation to addresses could ever be conclusive! Indeed you will find in the papers you yourself linked that the authors take great care to

  1. correlate address layout to performance measurements, in order to establish their thesis that this has a major effect on benchmark reproducibility
  2. control / stabilize address layout, and measure resulting reduced spread of performance measurements, in order to establish their thesis that this is the primary driver of benchmark irreproducibility in their samples.

They don’t simply take performance measurements and claim “yeah, must be address layouts”, nor do they suggest that all performance mysteries are caused by address layouts, nor do they claim that you would be able to determine the cause by glancing at the distribution of performance measurements.

You may find that this forum is both more helpful and more welcoming if you make at least a token attempt not to insult people. I’m not a mod, nor very invested here for the last couple years, so I’ll take my leave with that.

11 Likes

No. It’s due to all sorts of CPU hardware things, combined with address offsets.

1 Like

For what it’s worth, the authors of the papers cited here so far do explicitly describe Unix/Linux behaviors that cause memory layout effects. AFAIK caching can be the responsibility of the hardware or OS, depending on what is being cached.

2 Likes

Let’s keep the topic focused on benchmarking, not a meta-discussion about the topic itself or about the people here.

Respect for others underpins our entire community, and insults are never welcome.

15 Likes

I suggested from the beginning, and continue to suggest, that we have this discussion.

I am disappointed that so much of the discussion has been around whether the results I’m seeing are in fact caused by that. If you accept that it’s a factor in general, what does it matter?

This is why I felt that an MWE was not useful. I’m also disappointed that having provided a MWE and indeed boxplots of the results, there is still scepticism.