SVectors + @reset from Accessors: Strange Benchmarks

DNF · September 10, 2024, 7:34am

This is just exactly the same as

@reset x.min .+= x.act

and the latter is in fact syntactic sugar for the former, so there is no reason to switch.

aplavin · September 10, 2024, 7:46am

I think a way forward would be to really make it a minimal example and post here/submit an issue to Julia. In particular, I suspect stuff like rand() and Distributions are not relevant for this slowdown, and neither is Accessors. Did you try reducing your example?

DNF · September 10, 2024, 8:13am

Looking into this now, I cannot achieve the same speed with map as with manual unrolling. I second @aplavin’s suggestion to try to strip this down. I confirm that this happens even when Accessors is not used, and even if I remove the rand call, there is still a slowdown.

Tetrakai · September 10, 2024, 2:22pm

If we get rid of the random number, can’t the compiler just calculate the result? I think that is what I am seeing when I make it simpler.

Results differ on my PC vs laptop (which are also different Julia versions…) and depending on the optimization flag.

Slimmed MWE

using Parameters, StaticArrays, Accessors, BenchmarkTools
using Distributions, Random 

@with_kw struct Str00{N}
    fat   :: SVector{N, Float64}
    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = Str00(fat   = SVector{16}(fill(Float64(1.0), 16)),
            sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
            sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat
    end
    return gt
end

function test2(gt)
 for t in 1:100
    gt = map(gt) do x
        @reset x.sh_cm .= x.sh_c0 .* x.fat
        return x
    end
end
return gt
end


@btime  test1($gt);
@btime  test2($gt);


res1 = test1(gt);
res2 = test2(gt);

res1 == res2

System info

Laptop (i5-6300u Julia 1.10.4)
PC (2990wx Julia 1.11.0-rc-1)

Laptop

No flag, does not depend on outer loop length

julia> @btime  test1($gt);
  23.730 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  24.456 ns (0 allocations: 0 bytes)

Unless the -O1 optimization flag is used (but not -O2/-O3):

julia> @btime  test1($gt);
  12.909 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  12.497 μs (0 allocations: 0 bytes)

PC

No flag, test2 depends on outer loop

julia> @btime  test1($gt);
  24.407 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.592 μs (0 allocations: 0 bytes)

PC -O1 flag

julia> @btime  test1($gt);
  8.403 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  9.490 μs (0 allocations: 0 bytes)

PC -O2 (or -03) flag

julia> @btime  test1($gt);
  25.980 ns (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  6.594 μs (0 allocations: 0 bytes)

Tetrakai · September 10, 2024, 3:04pm

I had assumed the -O0 flag was default, but now i see it is actually -O2. The results get even stranger, now test2 (the map do block) is faster on the laptop (but not the PC):

Laptop -O0

julia> @btime  test1($gt);
  16.880 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  4.037 μs (0 allocations: 0 bytes)

PC -O0

julia> @btime  test1($gt);
  6.784 μs (0 allocations: 0 bytes)

julia> @btime  test2($gt);
  10.710 μs (0 allocations: 0 bytes)

So benchmarking this is a mess.

DNF · September 10, 2024, 3:13pm

I don’t see how that would work. You need the input values to calculate the output values, surely? Everything isn’t just a bunch of static constants, as far as I understand.

Tetrakai · September 10, 2024, 3:21pm

If you look close at that example nothing was getting iterated. It was the same calculation over and over.

Ie:

for t in 1:100
    x = x0*y
end

And still gave different results.

DNF · September 10, 2024, 3:29pm

Oh yeah. But what is the point of the outer loop? Is it only there for the benchmarking?

Tetrakai · September 10, 2024, 3:37pm

In the bigger MWE gt.fat was being reduced by a const + rand each iteration. That was then used to further iterate the other sh_cm, ps_cm, tk_cm vectors.

In the final code the gt.act bool vector would also change. That determines which elements of the vectors are iterated or set to zero for that iteration.

If it helps:
fat = fatigue
act = activity flag

Tetrakai · September 10, 2024, 4:20pm

OK, this appears to be the smallest MWE that has to iterate and shows the difference on both my pc and laptop.

MWE

using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools


@with_kw struct BigStr{N}
    act :: SVector{N, Bool}
    fat :: SVector{N, Float64}
    ded :: SVector{N, Float64}

    sh_c0 :: SVector{N, Float64}
    sh_cm :: SVector{N, Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


bg2 = BigStr(act   = SVector{16}(rand(Bool, 16)),
             fat   = SVector{16}(fill(Float64(1), 16)),
             ded   = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
             sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
             sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))


gt = (bg1, bg2)


function test1(gt)
    for t in 1:100
        @reset gt[1].fat .-= gt[1].act .* (gt[1].ded)
        @reset gt[2].fat .-= gt[2].act .* (gt[2].ded)

        @reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        @reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            @reset x.fat  .-= x.act .* x.ded
            @reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1($gt);
@btime test2($gt);

res1 = test1(gt);
res2 = test2(gt);

res1 == res2

Both are with no optimization flag (should be -O2 by default).

Laptop

julia> @btime test1($gt);
  781.444 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  6.935 μs (0 allocations: 0 bytes)

PC

julia> @btime test1($gt);
  793.865 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  8.360 μs (0 allocations: 0 bytes)

If I comment out the sh_cm lines, then test1 and test2 performance are equal on my PC, but not my laptop.

Laptop:

julia> @btime test1($gt);
  887.312 ns (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.842 μs (0 allocations: 0 bytes)

PC:

julia> @btime test1($gt);
  1.905 μs (0 allocations: 0 bytes)

julia> @btime test2($gt);
  1.906 μs (0 allocations: 0 bytes)

So, I dunno. I guess I need to try with normal vectors to remove the Accessors dependency too…

EDIT:
Also, the above all scale as expected with number of iterations. Another thing is that its running faster on the laptop than the PC. Laptop has a much cheaper Intel cpu vs AMD 2990wx and older version of julia (1.10.4 vs 1.11.0-rc1). So that seems counterintuitive as well.

Tetrakai · September 10, 2024, 5:38pm

Here it is without StaticArrays and Accessors:

Normal Vector MWE

using Parameters, Distributions
using Random, BenchmarkTools


@with_kw struct BigStr
    act :: Vector{Bool}
    fat :: Vector{Float64}
    ded :: Vector{Float64}

    sh_c0 :: Vector{Float64}
    sh_cm :: Vector{Float64}
end


Random.seed!(1234)
bg1 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


bg2 = BigStr(act   = rand(Bool, 16),
             fat   = fill(Float64(1), 16),
             ded   = rand(Uniform(0.002, 0.004), 16),
             sh_c0 = rand(Uniform(4, 25), 16),
             sh_cm = rand(Uniform(4, 25), 16));


gt = (bg1, bg2);


function test1(gt)
    for t in 1:100

        gt[1].fat .-= gt[1].act .* (gt[1].ded)
        gt[2].fat .-= gt[2].act .* (gt[2].ded)

        gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
        gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act

    end
    return gt
end

function test2(gt)
     for t in 1:100
        gt = map(gt) do x
            x.fat  .-= x.act .* x.ded
            x.sh_cm .= x.sh_c0 .* x.fat .* x.act
            return x
        end
    end
    return gt
end


@btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
@btime test2(x) setup = (x = deepcopy($gt)) evals = 1;

res1 = test1(deepcopy(gt));
res2 = test2(deepcopy(gt));

res1 == res2

### Above says false but the elements all look right and match
# hcat(res1[1].act,   res2[1].act)
# hcat(res1[1].fat,   res2[1].fat)
# hcat(res1[1].ded,   res2[1].ded)
# hcat(res1[1].sh_c0, res2[1].sh_c0)
# hcat(res1[1].sh_cm, res2[1].sh_cm)

# hcat(res1[2].act,   res2[2].act)
# hcat(res1[2].fat,   res2[2].fat)
# hcat(res1[2].ded,   res2[2].ded)
# hcat(res1[2].sh_c0, res2[2].sh_c0)
# hcat(res1[2].sh_cm, res2[2].sh_cm)

Laptop

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  11.078 μs (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  6.729 μs (0 allocations: 0 bytes)

PC

julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
  29.381 μs (0 allocations: 0 bytes)

julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
  26.00 μs (0 allocations: 0 bytes)

I spot-checked the returned values, it looked the same as the SVector version and on both computers.

In contrast to using SVectors, now test1 (manually unrolled) is slower
Using SVectors was definitely faster… except for test2 (map/do block) on laptop.
My laptop is unexpectedly faster than PC for both versions.

So the original issue does seem to depend on using SVectors/@reset.

EDIT:
Regarding the laptop vs PC issue, I thought --check-bounds=yes was the default but turns out it was auto.

Once I left off that flag, the normal vector function took test1: 9.2 μs and test2: 5.9 μs. Ie, 3x faster than before and slighly faster than the laptop.

For SVector it was test1: 794 ns and test2: 7.8 μs. Ie, results were unaffected by the forced bounds-checking.

Tetrakai · September 13, 2024, 3:00pm

So these benchmarks became a mess due to code-complexity and device dependencies.

But my conclusion is that using SVector and the map/do-block is 2-3x slower than manually unrolled (at least for two iterations).

For my purposes I just moved forward with the manually unrolled function, but this would be annoying for a larger loop. File a github issue?

Topic		Replies	Views
Efficiently creating and updating a struct of SVectors Performance struct , staticarrays , setfield , accessors	10	196	September 23, 2024
Performance regression with StaticArrays? Performance question , staticarrays	5	469	January 27, 2023
Modifying subentries in a Vector{SVector}, @set and reinterpret Performance staticarrays , accessors	1	71	June 24, 2025
Correct way to benchmark StaticVectors General Usage benchmarktools , staticarrays	4	151	September 13, 2024
Accessing a Vector of SVectors with 2 ranges -> Type instability New to Julia performance	1	519	March 6, 2018

SVectors + @reset from Accessors: Strange Benchmarks

Related topics