SVectors + @reset from Accessors: Strange Benchmarks

Full MWE:

MWE
using StaticArrays, Parameters, Accessors, BenchmarkTools

# Test Functions
function loop1(dat)
    for t in 1:10
        for i in eachindex(dat)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end


# Main
@with_kw struct Str
    A :: SVector{10, Int16}
    B :: SVector{10, Int16}
end

str = Str(A = SVector{10, Int16}(1:10), B = SVector{10, Int16}(11:20));
dat = (str, deepcopy(str));

res1 = loop1(dat); 
res2 = loop2(dat);


# Expected Results
hcat(res1[1].A, res1[2].A)
res1 == res2


# Benchmarks
@btime loop1(x) setup = (x = $dat) evals = 1;
@btime loop2(x) setup = (x = $dat) evals = 1;


@btime loop1($dat);
@btime loop2($dat);

I have two structs of SVectors stored in a tuple, like so:

@with_kw struct Str
    A :: SVector{10, Int16}
    B :: SVector{10, Int16}
end

str = Str(A = SVector{10, Int16}(1:10), B = SVector{10, Int16}(11:20));
dat = (str, deepcopy(str));

I wanted to increment the values in those SVectors and tried two different methods:

function loop1(dat)
    for t in 1:10
        for i in eachindex(dat)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end

Both give the correct answer:

julia> hcat(res1[1].A, res1[2].A)
10×2 SMatrix{10, 2, Int16, 20} with indices SOneTo(10)×SOneTo(2):
 111  111
 122  122
 133  133
 144  144
 155  155
 166  166
 177  177
 188  188
 199  199
 210  210

julia> res1 == res2
true

But I don’t know what to make of these Benchmarks:

julia> @btime loop1(x) setup = (x = $dat) evals = 1;
  709.000 ns (0 allocations: 0 bytes)

julia> @btime loop2(x) setup = (x = $dat) evals = 1;
  28.000 ns (0 allocations: 0 bytes)

julia> @btime loop1($dat);
  581.249 ns (0 allocations: 0 bytes)

julia> @btime loop2($dat);
  5.040 ns (0 allocations: 0 bytes)
  1. Why is loop1 so much slower than loop2?
  2. Why is the number of evals so important when benchmarking this?

I would much prefer to use the loop1 approach, but it is 100x slower despite no allocations. Any ideas? Thanks.

1 Like

The @code_llvm loop2(dat) is much shorter than @code_llvm loop1(dat) to an incredible degree (72 < 491 lines). Replacing for i in eachindex(dat) with for i in 1:2 does nothing for loop1. I think I’m seeing a lot more unrolled loops in loop1 (20-long sequences of getelementptr inbounds, load, store, phi, etc) so it’s not just a matter of more unrolling optimizations from the manual unrolling.

1 Like

The length of loop1 is not statically known, while loop2 is manually unrolled. That’s the difference.

The number of evals is important, because each iteration is too fast to measure directly. You need to run many evals to get a good measurement.

The good news is that you don’t need the setup part of the benchmark code. Since dat is completely immutable (SVectors wrapped in an immutable struct, wrapped in a Tuple), you can just run @btime loopx($dat) without worrying about overwriting your own results multiple times.

This also means that this

dat = (str, deepcopy(str));

is unnecessary, just write dat = (str, str).

Fixing your code to be more general without losing performance isn’t obvious. I think this will work:

function loop4(dat)
   for t in 1:10
       dat = (s -> (@reset s.A .+= s.B)).(dat)
   end
   return dat
end
julia> @btime loop4($dat)
  3.500 ns (0 allocations: 0 bytes)

Broadcasting works well with tuples, while eachindex returns an iterator without static information about length.

2 Likes

Thanks, your fix works but I don’t think your theory is correct. As @Benny said above using for i in 1:2 or SOneTo(2) or SA[1, 2] also won’t help:

Static Loops
function loop1b(dat)
    for t in 1:10
        for i in 1:2
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end


function loop1c(dat)
    for t in 1:10
        for i in SOneTo(2)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop1d(dat)
    for t in 1:10
        for i in SA[1, 2]
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

@btime loop1b($dat);
@btime loop1c($dat);
@btime loop1d($dat);
julia> @btime loop1b($dat);
  578.495 ns (0 allocations: 0 bytes)

julia> @btime loop1c($dat);
  581.132 ns (0 allocations: 0 bytes)

julia> @btime loop1d($dat);
  604.475 ns (0 allocations: 0 bytes)

However! That is on my laptop. On my threadripper, SA[1, 2] is twice as fast as the others (but still slower than the manually unrolled loop):

julia> @btime loop1b($dat);
  414.432 ns (0 allocations: 0 bytes)

julia> @btime loop1c($dat);
  414.281 ns (0 allocations: 0 bytes)

julia> @btime loop1d($dat);
  238.158 ns (0 allocations: 0 bytes)

I agree, something more is going on. I found this from a couple years ago which probably explains it:

In other words, I think @set a[i].s = 1 is interpreted as : “give me a copy of a in which the x field of the i-th element is set to 1”. As opposed to “give me a copy of a[i] in which the x field is set to 1” (which would be the desired semantics here).
Container of mutable structs without 100x slowdown? - #5 by ffevotte

It looks like a proposed fix was rejected?

Well, the theory was that broadcasting unrolls, and the loop doesn’t, though admittedly I emphasized eachindex instead if the for loop itself.

But if that was changed, your code wouldn’t work, since you cannot mutate a tuple, you have to create a new tuple with a new Str with new SVectors.

1 Like

To be honest I just discovered this package and have little understanding of how it does the update. It did magically get rid of all my heap allocations with a huge speedup though.

But it seems like there could be a macro placed before the for loop that would unroll it before @reset is called. This would be more readable than broadcasting over the tuple.

Especially for anything more complex like @reset dat[i].A .+= dat[i].B .* (dat[i].C .- dat[i].D)

However, I assume that would exist already if it was easy to implement.

Eg, here is one actual loop this would be used on. Something like @Unroll_before_reset would be ideal:

function test6(gt)
    for t in 1:10
       @Unroll_before_reset for i in 1:2
            @reset gt[i].min .+= gt[i].act

            r1 = @SVector rand(Uniform(-.003, .003), 16)
            @reset gt[i].fat .-= gt[i].act .* (gt[i].ded .- r1)
            @reset gt[i].fat .= max.(gt[i].fat, 0.1)

            @reset gt[i].sh_cm .= gt[i].sh_c0 .* gt[i].fat .* gt[i].act
            @reset gt[i].ps_cm .= gt[i].ps_c0 .* gt[i].fat .* gt[i].act
            @reset gt[i].tk_cm .= gt[i].tk_c0 .* gt[i].fat .* gt[i].act
        end
    end
    return gt
end

Even better would be @Unroll_then_reset for cases where @reset is being called on every line. I’ve yet to explore writing macros, but this discussion gives the impression it would not be trivial:

The way to achieve this would be to write a separate function (anonymous or not), and the broadcast that:

function loopn(dat)
    foo(x) = (@reset x.A .+= x.B .* (x.C .- x.D))
    for t in 1:10
        dat = foo.(dat)
    end
    return dat
end

Something like this? (warning: untested)

function test6(gt)
	function f(gt, d)
	    @reset gt.min .+= gt.act

		@reset gt.fat .-= gt.act .* (gt.ded .- rand.(d))
		@reset gt.fat .= max.(gt.fat, 0.1)

		@reset gt.sh_cm .= gt.sh_c0 .* gt.fat .* gt.act
		@reset gt.ps_cm .= gt.ps_c0 .* gt.fat .* gt.act
		@reset gt.tk_cm .= gt.tk_c0 .* gt.fat .* gt.act
	end
	d = Uniform(-0.003, 0.003)
    for t in 1:10
		gt = f.(gt, d)
	end
    return gt
end
1 Like

Or maybe try do block syntax:

function test7(gt)
	for _ in 1:10
		gt = map(gt) do x
			@reset x.min .+= x.act

			@reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
			@reset x.fat .= max.(x.fat, 0.1)

			@reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
			@reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
			@reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
		end
	end
end

Edit: Oops. Forgot the outer loop.

1 Like

Thanks, assuming you meant:

function test7(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Both test6 and test7 are much improved but still 3x slower than the manually unrolled loop (both were equally performant).

I also tried a separate function for each line, ie foo1(x) = (@reset x.min .+= x.act) , etc but that was 6x slower.

That’s odd. I cannot benchmark it, though, since I do not have all the code. But I’ll note that these two are equally fast on my laptop:

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end

function loop7(dat)
    for t in 1:10
        dat = map(dat) do s
            @reset (s.A .+= s.B)
        end
    end
    return dat
end

There might be some other minor thing preventing full performance, unrelated to map vs manual unroll.

1 Like

The original MWE was the same for me as well.

Ill generate an input tuple for the more involved loop in a bit and share it since the complexity (and for me: cpu) seems to affect the benchmarks.

IMO the most straightforward solution is

map(dat) do d
    @set d.A .+= d.B .* (d.C - d.D)
end

Any issues with it?

1 Like

That PR was accepted, and you can do stuff like @set $(dat[i]).A = ... that returns dat[i] with modified A value in it. But that PR doesn’t change any default behavior, and was never intended to: what else should newdat = @set f(dat[2].a) = 1 mean other than “give me a modification of dat so that f(newdat[2].a) is equal to 1”?.

1 Like

Actually I only tried @reset so far.

Ok, new MWE:

New MWE
using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools


@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    min :: SVector{N, Int16}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    ps_c0 :: SVector{N, Float64}
    tk_c0 :: SVector{N, Float64}

    sh_cm :: SVector{N, Float64}
    ps_cm :: SVector{N, Float64}
    tk_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             min   = SVector{16}(fill(Int16(0), 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             min   = SVector{16}(fill(Int16(0), 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:10
        @reset gt[1].min .+= gt[1].act
        @reset gt[2].min .+= gt[2].act

        r1 = @SVector rand(Uniform(-.003, .003), 16)
        r2 = @SVector rand(Uniform(-.003, .003), 16)
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded .- r1)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded .- r2)

        @reset gt[1].fat .= max.(gt[1].fat, 0.1)
        @reset gt[2].fat .= max.(gt[2].fat, 0.1)


        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].ps_cm .= gt[1].ps_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].tk_cm .= gt[1].tk_c0 .* gt[1].fat .* gt[1].act

        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].ps_cm .= gt[2].ps_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].tk_cm .= gt[2].tk_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

function test2(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Random.seed!(1234)
@btime  test1($gt);
Random.seed!(1234)
@btime  test2($gt);

Random.seed!(1234)
res1 = test1(gt);
Random.seed!(1234)
res2 = test2(gt);

res1 == res2

The input looks like two of these in a tuple:

@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    min :: SVector{N, Int16}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    ps_c0 :: SVector{N, Float64}
    tk_c0 :: SVector{N, Float64}

    sh_cm :: SVector{N, Float64}
    ps_cm :: SVector{N, Float64}
    tk_cm :: SVector{N, Float64}
end

Manually unrolled:

function test1(gt)
    for t in 1:10
        @reset gt[1].min .+= gt[1].act
        @reset gt[2].min .+= gt[2].act

        r1 = @SVector rand(Uniform(-.003, .003), 16)
        r2 = @SVector rand(Uniform(-.003, .003), 16)
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded .- r1)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded .- r2)

        @reset gt[1].fat .= max.(gt[1].fat, 0.1)
        @reset gt[2].fat .= max.(gt[2].fat, 0.1)

        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].ps_cm .= gt[1].ps_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].tk_cm .= gt[1].tk_c0 .* gt[1].fat .* gt[1].act

        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].ps_cm .= gt[2].ps_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].tk_cm .= gt[2].tk_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

Do block:

function test2(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Benchmarks (Laptop)

julia> @btime test1($gt);
  952.435 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  2.028 μs (0 allocations: 0 bytes)

I tried changing to @set for the do block loop, but it just returned the input. So I’ll have to look into how I am misusing it.

I’m too unfamiliar with LLVM to recognize what they are doing, but I am pretty curious what sort of optimization pass trimmed loop2 so much. Really looked like loop1 was shuffling data around a lot more on the stack.

1 Like

Is this using @set correctly?

function test3(gt)
     for t in 1:90
        gt = map(gt) do x
            x = @set x.min .+= x.act

            x = @set x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            x = @set x.fat .= max.(x.fat, 0.1)

            x = @set x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            x = @set x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            x = @set x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

It gives the same benchmarks as @reset, so the manually unrolled loop still wins.