SVectors + @reset from Accessors: Strange Benchmarks

This is just exactly the same as

@reset x.min .+= x.act

and the latter is in fact syntactic sugar for the former, so there is no reason to switch.

I think a way forward would be to really make it a minimal example and post here/submit an issue to Julia. In particular, I suspect stuff like rand() and Distributions are not relevant for this slowdown, and neither is Accessors. Did you try reducing your example?

3 Likes

Looking into this now, I cannot achieve the same speed with map as with manual unrolling. I second @aplavinā€™s suggestion to try to strip this down. I confirm that this happens even when Accessors is not used, and even if I remove the rand call, there is still a slowdown.

If we get rid of the random number, canā€™t the compiler just calculate the result? I think that is what I am seeing when I make it simpler.

Results differ on my PC vs laptop (which are also different Julia versionsā€¦) and depending on the optimization flag.

Slimmed MWE
using Parameters, StaticArrays, Accessors, BenchmarkTools
using Distributions, Random 

@with_kw struct Str00{N}
    fat   :: SVector{N, Float64}
    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat
    end
    return gt
end

function test2(gt)
 for t in 1:100
    gt = map(gt) do x
        @reset x.sh_cm .= x.sh_c0 .* x.fat
        return x
    end
end
return gt
end


@btime  test1($gt);
@btime  test2($gt);


res1 = test1(gt);
res2 = test2(gt);

res1 == res2

System info

Laptop (i5-6300u Julia 1.10.4)
PC (2990wx Julia 1.11.0-rc-1)

Laptop

  • No flag, does not depend on outer loop length
julia> @btime  test1($gt);
  23.730 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  24.456 ns (0 allocations: 0 bytes)

Unless the -O1 optimization flag is used (but not -O2/-O3):

julia> @btime  test1($gt);
  12.909 Ī¼s (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  12.497 Ī¼s (0 allocations: 0 bytes)

PC

  • No flag, test2 depends on outer loop
julia> @btime  test1($gt);
  24.407 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.592 Ī¼s (0 allocations: 0 bytes)

PC -O1 flag

julia> @btime  test1($gt);
  8.403 Ī¼s (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  9.490 Ī¼s (0 allocations: 0 bytes)

PC -O2 (or -03) flag

julia> @btime  test1($gt);
  25.980 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.594 Ī¼s (0 allocations: 0 bytes)

I had assumed the -O0 flag was default, but now i see it is actually -O2. The results get even stranger, now test2 (the map do block) is faster on the laptop (but not the PC):

Laptop -O0

julia> @btime  test1($gt);
  16.880 Ī¼s (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  4.037 Ī¼s (0 allocations: 0 bytes)

PC -O0

julia> @btime  test1($gt);
  6.784 Ī¼s (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  10.710 Ī¼s (0 allocations: 0 bytes)

So benchmarking this is a mess.

I donā€™t see how that would work. You need the input values to calculate the output values, surely? Everything isnā€™t just a bunch of static constants, as far as I understand.

If you look close at that example nothing was getting iterated. It was the same calculation over and over.

Ie:

for t in 1:100
    x = x0*y
end

And still gave different results.

Oh yeah. But what is the point of the outer loop? Is it only there for the benchmarking?

In the bigger MWE gt.fat was being reduced by a const + rand each iteration. That was then used to further iterate the other sh_cm, ps_cm, tk_cm vectors.

In the final code the gt.act bool vector would also change. That determines which elements of the vectors are iterated or set to zero for that iteration.

If it helps:
fat = fatigue
act = activity flag

OK, this appears to be the smallest MWE that has to iterate and shows the difference on both my pc and laptop.

MWE
using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools


@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded)

        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            @reset x.fat  .-= x.act .* x.ded
            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1($gt);
@btime test2($gt);

res1 = test1(gt);
res2 = test2(gt);

res1 == res2

Both are with no optimization flag (should be -O2 by default).

Laptop

julia> @btime test1($gt);
  781.444 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  6.935 Ī¼s (0 allocations: 0 bytes)

PC

julia> @btime test1($gt);
  793.865 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  8.360 Ī¼s (0 allocations: 0 bytes)

If I comment out the sh_cm lines, then test1 and test2 performance are equal on my PC, but not my laptop.

Laptop:

julia> @btime test1($gt);
  887.312 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.842 Ī¼s (0 allocations: 0 bytes)

PC:

julia> @btime test1($gt);
  1.905 Ī¼s (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.906 Ī¼s (0 allocations: 0 bytes)

So, I dunno. I guess I need to try with normal vectors to remove the Accessors dependency tooā€¦

EDIT:
Also, the above all scale as expected with number of iterations. Another thing is that its running faster on the laptop than the PC. Laptop has a much cheaper Intel cpu vs AMD 2990wx and older version of julia (1.10.4 vs 1.11.0-rc1). So that seems counterintuitive as well.

Here it is without StaticArrays and Accessors:

Normal Vector MWE
using Parameters, Distributions
using Random, BenchmarkTools


@with_kw struct BigStr
    act :: Vector{Bool}
    fat :: Vector{Float64}
    ded :: Vector{Float64}

    sh_c0 :: Vector{Float64}
    sh_cm :: Vector{Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


bg2 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


gt = (bg1, bg2);


function test1(gt)
    for t in 1:100

        gt[1].fat .-= gt[1].act .* (gt[1].ded)
        gt[2].fat .-= gt[2].act .* (gt[2].ded)

        gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act

    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            x.fat  .-= x.act .* x.ded
            x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
@btime test2(x) setup = (x = deepcopy($gt)) evals = 1;

res1 = test1(deepcopy(gt));
res2 = test2(deepcopy(gt));

res1 == res2

### Above says false but the elements all look right and match
# hcat(res1[1].act,   res2[1].act)
# hcat(res1[1].fat,   res2[1].fat)
# hcat(res1[1].ded,   res2[1].ded)
# hcat(res1[1].sh_c0, res2[1].sh_c0)
# hcat(res1[1].sh_cm, res2[1].sh_cm)

# hcat(res1[2].act,   res2[2].act)
# hcat(res1[2].fat,   res2[2].fat)
# hcat(res1[2].ded,   res2[2].ded)
# hcat(res1[2].sh_c0, res2[2].sh_c0)
# hcat(res1[2].sh_cm, res2[2].sh_cm)

Laptop

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  11.078 Ī¼s (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  6.729 Ī¼s (0 allocations: 0 bytes)

PC

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  29.381 Ī¼s (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  26.00 Ī¼s (0 allocations: 0 bytes)

I spot-checked the returned values, it looked the same as the SVector version and on both computers.

  • In contrast to using SVectors, now test1 (manually unrolled) is slower
  • Using SVectors was definitely fasterā€¦ except for test2 (map/do block) on laptop.
  • My laptop is unexpectedly faster than PC for both versions.

So the original issue does seem to depend on using SVectors/@reset.

EDIT:
Regarding the laptop vs PC issue, I thought --check-bounds=yes was the default but turns out it was auto.

Once I left off that flag, the normal vector function took test1: 9.2 Ī¼s and test2: 5.9 Ī¼s. Ie, 3x faster than before and slighly faster than the laptop.

For SVector it was test1: 794 ns and test2: 7.8 Ī¼s. Ie, results were unaffected by the forced bounds-checking.

So these benchmarks became a mess due to code-complexity and device dependencies.

But my conclusion is that using SVector and the map/do-block is 2-3x slower than manually unrolled (at least for two iterations).

For my purposes I just moved forward with the manually unrolled function, but this would be annoying for a larger loop. File a github issue?