SVectors + @reset from Accessors: Strange Benchmarks

Full MWE:

MWE
using StaticArrays, Parameters, Accessors, BenchmarkTools

# Test Functions
function loop1(dat)
    for t in 1:10
        for i in eachindex(dat)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end


# Main
@with_kw struct Str
    A :: SVector{10, Int16}
    B :: SVector{10, Int16}
end

str = Str(A = SVector{10, Int16}(1:10), B = SVector{10, Int16}(11:20));
dat = (str, deepcopy(str));

res1 = loop1(dat); 
res2 = loop2(dat);


# Expected Results
hcat(res1[1].A, res1[2].A)
res1 == res2


# Benchmarks
@btime loop1(x) setup = (x = $dat) evals = 1;
@btime loop2(x) setup = (x = $dat) evals = 1;


@btime loop1($dat);
@btime loop2($dat);

I have two structs of SVectors stored in a tuple, like so:

@with_kw struct Str
    A :: SVector{10, Int16}
    B :: SVector{10, Int16}
end

str = Str(A = SVector{10, Int16}(1:10), B = SVector{10, Int16}(11:20));
dat = (str, deepcopy(str));

I wanted to increment the values in those SVectors and tried two different methods:

function loop1(dat)
    for t in 1:10
        for i in eachindex(dat)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end

Both give the correct answer:

julia> hcat(res1[1].A, res1[2].A)
10Ɨ2 SMatrix{10, 2, Int16, 20} with indices SOneTo(10)ƗSOneTo(2):
 111  111
 122  122
 133  133
 144  144
 155  155
 166  166
 177  177
 188  188
 199  199
 210  210

julia> res1 == res2
true

But I donā€™t know what to make of these Benchmarks:

julia> @btime loop1(x) setup = (x = $dat) evals = 1;
  709.000 ns (0 allocations: 0 bytes)

julia> @btime loop2(x) setup = (x = $dat) evals = 1;
  28.000 ns (0 allocations: 0 bytes)

julia> @btime loop1($dat);
  581.249 ns (0 allocations: 0 bytes)

julia> @btime loop2($dat);
  5.040 ns (0 allocations: 0 bytes)
  1. Why is loop1 so much slower than loop2?
  2. Why is the number of evals so important when benchmarking this?

I would much prefer to use the loop1 approach, but it is 100x slower despite no allocations. Any ideas? Thanks.

1 Like

The @code_llvm loop2(dat) is much shorter than @code_llvm loop1(dat) to an incredible degree (72 < 491 lines). Replacing for i in eachindex(dat) with for i in 1:2 does nothing for loop1. I think Iā€™m seeing a lot more unrolled loops in loop1 (20-long sequences of getelementptr inbounds, load, store, phi, etc) so itā€™s not just a matter of more unrolling optimizations from the manual unrolling.

1 Like

The length of loop1 is not statically known, while loop2 is manually unrolled. Thatā€™s the difference.

The number of evals is important, because each iteration is too fast to measure directly. You need to run many evals to get a good measurement.

The good news is that you donā€™t need the setup part of the benchmark code. Since dat is completely immutable (SVectors wrapped in an immutable struct, wrapped in a Tuple), you can just run @btime loopx($dat) without worrying about overwriting your own results multiple times.

This also means that this

dat = (str, deepcopy(str));

is unnecessary, just write dat = (str, str).

Fixing your code to be more general without losing performance isnā€™t obvious. I think this will work:

function loop4(dat)
   for t in 1:10
       dat = (s -> (@reset s.A .+= s.B)).(dat)
   end
   return dat
end
julia> @btime loop4($dat)
  3.500 ns (0 allocations: 0 bytes)

Broadcasting works well with tuples, while eachindex returns an iterator without static information about length.

2 Likes

Thanks, your fix works but I donā€™t think your theory is correct. As @Benny said above using for i in 1:2 or SOneTo(2) or SA[1, 2] also wonā€™t help:

Static Loops
function loop1b(dat)
    for t in 1:10
        for i in 1:2
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end


function loop1c(dat)
    for t in 1:10
        for i in SOneTo(2)
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

function loop1d(dat)
    for t in 1:10
        for i in SA[1, 2]
            @reset dat[i].A .+= dat[i].B
        end
    end
    return dat
end

@btime loop1b($dat);
@btime loop1c($dat);
@btime loop1d($dat);
julia> @btime loop1b($dat);
  578.495 ns (0 allocations: 0 bytes)

julia> @btime loop1c($dat);
  581.132 ns (0 allocations: 0 bytes)

julia> @btime loop1d($dat);
  604.475 ns (0 allocations: 0 bytes)

However! That is on my laptop. On my threadripper, SA[1, 2] is twice as fast as the others (but still slower than the manually unrolled loop):

julia> @btime loop1b($dat);
  414.432 ns (0 allocations: 0 bytes)

julia> @btime loop1c($dat);
  414.281 ns (0 allocations: 0 bytes)

julia> @btime loop1d($dat);
  238.158 ns (0 allocations: 0 bytes)

I agree, something more is going on. I found this from a couple years ago which probably explains it:

In other words, I think @set a[i].s = 1 is interpreted as : ā€œgive me a copy of a in which the x field of the i-th element is set to 1ā€. As opposed to ā€œgive me a copy of a[i] in which the x field is set to 1ā€ (which would be the desired semantics here).
Container of mutable structs without 100x slowdown? - #5 by ffevotte

It looks like a proposed fix was rejected?

Well, the theory was that broadcasting unrolls, and the loop doesnā€™t, though admittedly I emphasized eachindex instead if the for loop itself.

But if that was changed, your code wouldnā€™t work, since you cannot mutate a tuple, you have to create a new tuple with a new Str with new SVectors.

1 Like

To be honest I just discovered this package and have little understanding of how it does the update. It did magically get rid of all my heap allocations with a huge speedup though.

But it seems like there could be a macro placed before the for loop that would unroll it before @reset is called. This would be more readable than broadcasting over the tuple.

Especially for anything more complex like @reset dat[i].A .+= dat[i].B .* (dat[i].C .- dat[i].D)

However, I assume that would exist already if it was easy to implement.

Eg, here is one actual loop this would be used on. Something like @Unroll_before_reset would be ideal:

function test6(gt)
    for t in 1:10
       @Unroll_before_reset for i in 1:2
            @reset gt[i].min .+= gt[i].act

            r1 = @SVector rand(Uniform(-.003, .003), 16)
            @reset gt[i].fat .-= gt[i].act .* (gt[i].ded .- r1)
            @reset gt[i].fat .= max.(gt[i].fat, 0.1)

            @reset gt[i].sh_cm .= gt[i].sh_c0 .* gt[i].fat .* gt[i].act
            @reset gt[i].ps_cm .= gt[i].ps_c0 .* gt[i].fat .* gt[i].act
            @reset gt[i].tk_cm .= gt[i].tk_c0 .* gt[i].fat .* gt[i].act
        end
    end
    return gt
end

Even better would be @Unroll_then_reset for cases where @reset is being called on every line. Iā€™ve yet to explore writing macros, but this discussion gives the impression it would not be trivial:

The way to achieve this would be to write a separate function (anonymous or not), and the broadcast that:

function loopn(dat)
    foo(x) = (@reset x.A .+= x.B .* (x.C .- x.D))
    for t in 1:10
        dat = foo.(dat)
    end
    return dat
end

Something like this? (warning: untested)

function test6(gt)
	function f(gt, d)
	    @reset gt.min .+= gt.act

		@reset gt.fat .-= gt.act .* (gt.ded .- rand.(d))
		@reset gt.fat .= max.(gt.fat, 0.1)

		@reset gt.sh_cm .= gt.sh_c0 .* gt.fat .* gt.act
		@reset gt.ps_cm .= gt.ps_c0 .* gt.fat .* gt.act
		@reset gt.tk_cm .= gt.tk_c0 .* gt.fat .* gt.act
	end
	d = Uniform(-0.003, 0.003)
    for t in 1:10
		gt = f.(gt, d)
	end
    return gt
end
1 Like

Or maybe try do block syntax:

function test7(gt)
	for _ in 1:10
		gt = map(gt) do x
			@reset x.min .+= x.act

			@reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
			@reset x.fat .= max.(x.fat, 0.1)

			@reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
			@reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
			@reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
		end
	end
end

Edit: Oops. Forgot the outer loop.

1 Like

Thanks, assuming you meant:

function test7(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Both test6 and test7 are much improved but still 3x slower than the manually unrolled loop (both were equally performant).

I also tried a separate function for each line, ie foo1(x) = (@reset x.min .+= x.act) , etc but that was 6x slower.

Thatā€™s odd. I cannot benchmark it, though, since I do not have all the code. But Iā€™ll note that these two are equally fast on my laptop:

function loop2(dat)
    for t in 1:10
        @reset dat[1].A .+= dat[1].B
        @reset dat[2].A .+= dat[2].B
    end
    return dat
end

function loop7(dat)
    for t in 1:10
        dat = map(dat) do s
            @reset (s.A .+= s.B)
        end
    end
    return dat
end

There might be some other minor thing preventing full performance, unrelated to map vs manual unroll.

1 Like

The original MWE was the same for me as well.

Ill generate an input tuple for the more involved loop in a bit and share it since the complexity (and for me: cpu) seems to affect the benchmarks.

IMO the most straightforward solution is

map(dat) do d
    @set d.A .+= d.B .* (d.C - d.D)
end

Any issues with it?

1 Like

That PR was accepted, and you can do stuff like @set $(dat[i]).A = ... that returns dat[i] with modified A value in it. But that PR doesnā€™t change any default behavior, and was never intended to: what else should newdat = @set f(dat[2].a) = 1 mean other than ā€œgive me a modification of dat so that f(newdat[2].a) is equal to 1ā€?.

1 Like

Actually I only tried @reset so far.

Ok, new MWE:

New MWE
using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools


@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    min :: SVector{N, Int16}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    ps_c0 :: SVector{N, Float64}
    tk_c0 :: SVector{N, Float64}

    sh_cm :: SVector{N, Float64}
    ps_cm :: SVector{N, Float64}
    tk_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             min   = SVector{16}(fill(Int16(0), 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             min   = SVector{16}(fill(Int16(0), 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             ps_cm = SVector{16}(rand(Uniform(4, 25), 16)),
             tk_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:10
        @reset gt[1].min .+= gt[1].act
        @reset gt[2].min .+= gt[2].act

        r1 = @SVector rand(Uniform(-.003, .003), 16)
        r2 = @SVector rand(Uniform(-.003, .003), 16)
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded .- r1)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded .- r2)

        @reset gt[1].fat .= max.(gt[1].fat, 0.1)
        @reset gt[2].fat .= max.(gt[2].fat, 0.1)


        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].ps_cm .= gt[1].ps_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].tk_cm .= gt[1].tk_c0 .* gt[1].fat .* gt[1].act

        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].ps_cm .= gt[2].ps_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].tk_cm .= gt[2].tk_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

function test2(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Random.seed!(1234)
@btime  test1($gt);
Random.seed!(1234)
@btime  test2($gt);

Random.seed!(1234)
res1 = test1(gt);
Random.seed!(1234)
res2 = test2(gt);

res1 == res2

The input looks like two of these in a tuple:

@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    min :: SVector{N, Int16}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    ps_c0 :: SVector{N, Float64}
    tk_c0 :: SVector{N, Float64}

    sh_cm :: SVector{N, Float64}
    ps_cm :: SVector{N, Float64}
    tk_cm :: SVector{N, Float64}
end

Manually unrolled:

function test1(gt)
    for t in 1:10
        @reset gt[1].min .+= gt[1].act
        @reset gt[2].min .+= gt[2].act

        r1 = @SVector rand(Uniform(-.003, .003), 16)
        r2 = @SVector rand(Uniform(-.003, .003), 16)
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded .- r1)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded .- r2)

        @reset gt[1].fat .= max.(gt[1].fat, 0.1)
        @reset gt[2].fat .= max.(gt[2].fat, 0.1)

        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].ps_cm .= gt[1].ps_c0 .* gt[1].fat .* gt[1].act
        @reset gt[1].tk_cm .= gt[1].tk_c0 .* gt[1].fat .* gt[1].act

        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].ps_cm .= gt[2].ps_c0 .* gt[2].fat .* gt[2].act
        @reset gt[2].tk_cm .= gt[2].tk_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

Do block:

function test2(gt)
    for t in 1:10
        gt = map(gt) do x
            @reset x.min .+= x.act

            @reset x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            @reset x.fat .= max.(x.fat, 0.1)

            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            @reset x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            @reset x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

Benchmarks (Laptop)

julia> @btime test1($gt);
  952.435 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  2.028 Ī¼s (0 allocations: 0 bytes)

I tried changing to @set for the do block loop, but it just returned the input. So Iā€™ll have to look into how I am misusing it.

Iā€™m too unfamiliar with LLVM to recognize what they are doing, but I am pretty curious what sort of optimization pass trimmed loop2 so much. Really looked like loop1 was shuffling data around a lot more on the stack.

1 Like

Is this using @set correctly?

function test3(gt)
     for t in 1:90
        gt = map(gt) do x
            x = @set x.min .+= x.act

            x = @set x.fat .-= x.act .* (x.ded .- rand.(Uniform(-0.003, 0.003)))
            x = @set x.fat .= max.(x.fat, 0.1)

            x = @set x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            x = @set x.ps_cm .= x.ps_c0 .* x.fat .* x.act
            x = @set x.tk_cm .= x.tk_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end

It gives the same benchmarks as @reset, so the manually unrolled loop still wins.