SVectors + @reset from Accessors: Strange Benchmarks

This is just exactly the same as

@reset x.min .+= x.act

and the latter is in fact syntactic sugar for the former, so there is no reason to switch.

I think a way forward would be to really make it a minimal example and post here/submit an issue to Julia. In particular, I suspect stuff like rand() and Distributions are not relevant for this slowdown, and neither is Accessors. Did you try reducing your example?

3 Likes

Looking into this now, I cannot achieve the same speed with map as with manual unrolling. I second @aplavin’s suggestion to try to strip this down. I confirm that this happens even when Accessors is not used, and even if I remove the rand call, there is still a slowdown.

If we get rid of the random number, can’t the compiler just calculate the result? I think that is what I am seeing when I make it simpler.

Results differ on my PC vs laptop (which are also different Julia versions…) and depending on the optimization flag.

Slimmed MWE
using Parameters, StaticArrays, Accessors, BenchmarkTools
using Distributions, Random 

@with_kw struct Str00{N}
    fat   :: SVector{N, Float64}
    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat
    end
    return gt
end

function test2(gt)
 for t in 1:100
    gt = map(gt) do x
        @reset x.sh_cm .= x.sh_c0 .* x.fat
        return x
    end
end
return gt
end


@btime  test1($gt);
@btime  test2($gt);


res1 = test1(gt);
res2 = test2(gt);

res1 == res2

System info

Laptop (i5-6300u Julia 1.10.4)
PC (2990wx Julia 1.11.0-rc-1)

Laptop

  • No flag, does not depend on outer loop length
julia> @btime  test1($gt);
  23.730 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  24.456 ns (0 allocations: 0 bytes)

Unless the -O1 optimization flag is used (but not -O2/-O3):

julia> @btime  test1($gt);
  12.909 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  12.497 μs (0 allocations: 0 bytes)

PC

  • No flag, test2 depends on outer loop
julia> @btime  test1($gt);
  24.407 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.592 μs (0 allocations: 0 bytes)

PC -O1 flag

julia> @btime  test1($gt);
  8.403 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  9.490 μs (0 allocations: 0 bytes)

PC -O2 (or -03) flag

julia> @btime  test1($gt);
  25.980 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.594 μs (0 allocations: 0 bytes)

I had assumed the -O0 flag was default, but now i see it is actually -O2. The results get even stranger, now test2 (the map do block) is faster on the laptop (but not the PC):

Laptop -O0

julia> @btime  test1($gt);
  16.880 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  4.037 μs (0 allocations: 0 bytes)

PC -O0

julia> @btime  test1($gt);
  6.784 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  10.710 μs (0 allocations: 0 bytes)

So benchmarking this is a mess.

I don’t see how that would work. You need the input values to calculate the output values, surely? Everything isn’t just a bunch of static constants, as far as I understand.

If you look close at that example nothing was getting iterated. It was the same calculation over and over.

Ie:

for t in 1:100
    x = x0*y
end

And still gave different results.

Oh yeah. But what is the point of the outer loop? Is it only there for the benchmarking?

In the bigger MWE gt.fat was being reduced by a const + rand each iteration. That was then used to further iterate the other sh_cm, ps_cm, tk_cm vectors.

In the final code the gt.act bool vector would also change. That determines which elements of the vectors are iterated or set to zero for that iteration.

If it helps:
fat = fatigue
act = activity flag

OK, this appears to be the smallest MWE that has to iterate and shows the difference on both my pc and laptop.

MWE
using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools


@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded)

        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            @reset x.fat  .-= x.act .* x.ded
            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1($gt);
@btime test2($gt);

res1 = test1(gt);
res2 = test2(gt);

res1 == res2

Both are with no optimization flag (should be -O2 by default).

Laptop

julia> @btime test1($gt);
  781.444 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  6.935 μs (0 allocations: 0 bytes)

PC

julia> @btime test1($gt);
  793.865 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  8.360 μs (0 allocations: 0 bytes)

If I comment out the sh_cm lines, then test1 and test2 performance are equal on my PC, but not my laptop.

Laptop:

julia> @btime test1($gt);
  887.312 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.842 μs (0 allocations: 0 bytes)

PC:

julia> @btime test1($gt);
  1.905 μs (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.906 μs (0 allocations: 0 bytes)

So, I dunno. I guess I need to try with normal vectors to remove the Accessors dependency too…

EDIT:
Also, the above all scale as expected with number of iterations. Another thing is that its running faster on the laptop than the PC. Laptop has a much cheaper Intel cpu vs AMD 2990wx and older version of julia (1.10.4 vs 1.11.0-rc1). So that seems counterintuitive as well.

Here it is without StaticArrays and Accessors:

Normal Vector MWE
using Parameters, Distributions
using Random, BenchmarkTools


@with_kw struct BigStr
    act :: Vector{Bool}
    fat :: Vector{Float64}
    ded :: Vector{Float64}

    sh_c0 :: Vector{Float64}
    sh_cm :: Vector{Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


bg2 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


gt = (bg1, bg2);


function test1(gt)
    for t in 1:100

        gt[1].fat .-= gt[1].act .* (gt[1].ded)
        gt[2].fat .-= gt[2].act .* (gt[2].ded)

        gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act

    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            x.fat  .-= x.act .* x.ded
            x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
@btime test2(x) setup = (x = deepcopy($gt)) evals = 1;

res1 = test1(deepcopy(gt));
res2 = test2(deepcopy(gt));

res1 == res2

### Above says false but the elements all look right and match
# hcat(res1[1].act,   res2[1].act)
# hcat(res1[1].fat,   res2[1].fat)
# hcat(res1[1].ded,   res2[1].ded)
# hcat(res1[1].sh_c0, res2[1].sh_c0)
# hcat(res1[1].sh_cm, res2[1].sh_cm)

# hcat(res1[2].act,   res2[2].act)
# hcat(res1[2].fat,   res2[2].fat)
# hcat(res1[2].ded,   res2[2].ded)
# hcat(res1[2].sh_c0, res2[2].sh_c0)
# hcat(res1[2].sh_cm, res2[2].sh_cm)

Laptop

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  11.078 μs (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  6.729 μs (0 allocations: 0 bytes)

PC

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  29.381 μs (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  26.00 μs (0 allocations: 0 bytes)

I spot-checked the returned values, it looked the same as the SVector version and on both computers.

  • In contrast to using SVectors, now test1 (manually unrolled) is slower
  • Using SVectors was definitely faster… except for test2 (map/do block) on laptop.
  • My laptop is unexpectedly faster than PC for both versions.

So the original issue does seem to depend on using SVectors/@reset.

EDIT:
Regarding the laptop vs PC issue, I thought --check-bounds=yes was the default but turns out it was auto.

Once I left off that flag, the normal vector function took test1: 9.2 μs and test2: 5.9 μs. Ie, 3x faster than before and slighly faster than the laptop.

For SVector it was test1: 794 ns and test2: 7.8 μs. Ie, results were unaffected by the forced bounds-checking.

So these benchmarks became a mess due to code-complexity and device dependencies.

But my conclusion is that using SVector and the map/do-block is 2-3x slower than manually unrolled (at least for two iterations).

For my purposes I just moved forward with the manually unrolled function, but this would be annoying for a larger loop. File a github issue?