This is just exactly the same as
@reset x.min .+= x.act
and the latter is in fact syntactic sugar for the former, so there is no reason to switch.
This is just exactly the same as
@reset x.min .+= x.act
and the latter is in fact syntactic sugar for the former, so there is no reason to switch.
I think a way forward would be to really make it a minimal example and post here/submit an issue to Julia. In particular, I suspect stuff like rand()
and Distributions
are not relevant for this slowdown, and neither is Accessors
. Did you try reducing your example?
Looking into this now, I cannot achieve the same speed with map
as with manual unrolling. I second @aplavinās suggestion to try to strip this down. I confirm that this happens even when Accessors is not used, and even if I remove the rand
call, there is still a slowdown.
If we get rid of the random number, canāt the compiler just calculate the result? I think that is what I am seeing when I make it simpler.
Results differ on my PC vs laptop (which are also different Julia versionsā¦) and depending on the optimization flag.
using Parameters, StaticArrays, Accessors, BenchmarkTools
using Distributions, Random
@with_kw struct Str00{N}
fat :: SVector{N, Float64}
sh_c0 :: SVector{N, Float64}
sh_cm :: SVector{N, Float64}
end
Random.seed!(1234)
bg1 = Str00(fat = SVector{16}(fill(Float64(1.0), 16)),
sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))
bg2 = Str00(fat = SVector{16}(fill(Float64(1.0), 16)),
sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))
gt = (bg1, bg2)
function test1(gt)
for t in 1:100
@reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat
@reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat
end
return gt
end
function test2(gt)
for t in 1:100
gt = map(gt) do x
@reset x.sh_cm .= x.sh_c0 .* x.fat
return x
end
end
return gt
end
@btime test1($gt);
@btime test2($gt);
res1 = test1(gt);
res2 = test2(gt);
res1 == res2
System info
Laptop (i5-6300u Julia 1.10.4)
PC (2990wx Julia 1.11.0-rc-1)
Laptop
julia> @btime test1($gt);
23.730 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
24.456 ns (0 allocations: 0 bytes)
Unless the -O1 optimization flag is used (but not -O2/-O3):
julia> @btime test1($gt);
12.909 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2($gt);
12.497 Ī¼s (0 allocations: 0 bytes)
PC
julia> @btime test1($gt);
24.407 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
6.592 Ī¼s (0 allocations: 0 bytes)
PC -O1 flag
julia> @btime test1($gt);
8.403 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2($gt);
9.490 Ī¼s (0 allocations: 0 bytes)
PC -O2 (or -03) flag
julia> @btime test1($gt);
25.980 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
6.594 Ī¼s (0 allocations: 0 bytes)
I had assumed the -O0 flag was default, but now i see it is actually -O2. The results get even stranger, now test2 (the map do block) is faster on the laptop (but not the PC):
Laptop -O0
julia> @btime test1($gt);
16.880 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2($gt);
4.037 Ī¼s (0 allocations: 0 bytes)
PC -O0
julia> @btime test1($gt);
6.784 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2($gt);
10.710 Ī¼s (0 allocations: 0 bytes)
So benchmarking this is a mess.
I donāt see how that would work. You need the input values to calculate the output values, surely? Everything isnāt just a bunch of static constants, as far as I understand.
If you look close at that example nothing was getting iterated. It was the same calculation over and over.
Ie:
for t in 1:100
x = x0*y
end
And still gave different results.
Oh yeah. But what is the point of the outer loop? Is it only there for the benchmarking?
In the bigger MWE gt.fat was being reduced by a const + rand each iteration. That was then used to further iterate the other sh_cm, ps_cm, tk_cm vectors.
In the final code the gt.act bool vector would also change. That determines which elements of the vectors are iterated or set to zero for that iteration.
If it helps:
fat = fatigue
act = activity flag
OK, this appears to be the smallest MWE that has to iterate and shows the difference on both my pc and laptop.
using Parameters, Distributions, StaticArrays
using Accessors, Random, BenchmarkTools
@with_kw struct BigStr{N}
act :: SVector{N, Bool}
fat :: SVector{N, Float64}
ded :: SVector{N, Float64}
sh_c0 :: SVector{N, Float64}
sh_cm :: SVector{N, Float64}
end
Random.seed!(1234)
bg1 = BigStr(act = SVector{16}(rand(Bool, 16)),
fat = SVector{16}(fill(Float64(1), 16)),
ded = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))
bg2 = BigStr(act = SVector{16}(rand(Bool, 16)),
fat = SVector{16}(fill(Float64(1), 16)),
ded = SVector{16}(rand(Uniform(0.002, 0.004), 16)),
sh_c0 = SVector{16}(rand(Uniform(4, 25), 16)),
sh_cm = SVector{16}(rand(Uniform(4, 25), 16)))
gt = (bg1, bg2)
function test1(gt)
for t in 1:100
@reset gt[1].fat .-= gt[1].act .* (gt[1].ded)
@reset gt[2].fat .-= gt[2].act .* (gt[2].ded)
@reset gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
@reset gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
end
return gt
end
function test2(gt)
for t in 1:100
gt = map(gt) do x
@reset x.fat .-= x.act .* x.ded
@reset x.sh_cm .= x.sh_c0 .* x.fat .* x.act
return x
end
end
return gt
end
@btime test1($gt);
@btime test2($gt);
res1 = test1(gt);
res2 = test2(gt);
res1 == res2
Both are with no optimization flag (should be -O2 by default).
Laptop
julia> @btime test1($gt);
781.444 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
6.935 Ī¼s (0 allocations: 0 bytes)
PC
julia> @btime test1($gt);
793.865 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
8.360 Ī¼s (0 allocations: 0 bytes)
If I comment out the sh_cm
lines, then test1 and test2 performance are equal on my PC, but not my laptop.
Laptop:
julia> @btime test1($gt);
887.312 ns (0 allocations: 0 bytes)
julia> @btime test2($gt);
1.842 Ī¼s (0 allocations: 0 bytes)
PC:
julia> @btime test1($gt);
1.905 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2($gt);
1.906 Ī¼s (0 allocations: 0 bytes)
So, I dunno. I guess I need to try with normal vectors to remove the Accessors dependency tooā¦
EDIT:
Also, the above all scale as expected with number of iterations. Another thing is that its running faster on the laptop than the PC. Laptop has a much cheaper Intel cpu vs AMD 2990wx and older version of julia (1.10.4 vs 1.11.0-rc1). So that seems counterintuitive as well.
Here it is without StaticArrays
and Accessors
:
using Parameters, Distributions
using Random, BenchmarkTools
@with_kw struct BigStr
act :: Vector{Bool}
fat :: Vector{Float64}
ded :: Vector{Float64}
sh_c0 :: Vector{Float64}
sh_cm :: Vector{Float64}
end
Random.seed!(1234)
bg1 = BigStr(act = rand(Bool, 16),
fat = fill(Float64(1), 16),
ded = rand(Uniform(0.002, 0.004), 16),
sh_c0 = rand(Uniform(4, 25), 16),
sh_cm = rand(Uniform(4, 25), 16));
bg2 = BigStr(act = rand(Bool, 16),
fat = fill(Float64(1), 16),
ded = rand(Uniform(0.002, 0.004), 16),
sh_c0 = rand(Uniform(4, 25), 16),
sh_cm = rand(Uniform(4, 25), 16));
gt = (bg1, bg2);
function test1(gt)
for t in 1:100
gt[1].fat .-= gt[1].act .* (gt[1].ded)
gt[2].fat .-= gt[2].act .* (gt[2].ded)
gt[1].sh_cm .= gt[1].sh_c0 .* gt[1].fat .* gt[1].act
gt[2].sh_cm .= gt[2].sh_c0 .* gt[2].fat .* gt[2].act
end
return gt
end
function test2(gt)
for t in 1:100
gt = map(gt) do x
x.fat .-= x.act .* x.ded
x.sh_cm .= x.sh_c0 .* x.fat .* x.act
return x
end
end
return gt
end
@btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
@btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
res1 = test1(deepcopy(gt));
res2 = test2(deepcopy(gt));
res1 == res2
### Above says false but the elements all look right and match
# hcat(res1[1].act, res2[1].act)
# hcat(res1[1].fat, res2[1].fat)
# hcat(res1[1].ded, res2[1].ded)
# hcat(res1[1].sh_c0, res2[1].sh_c0)
# hcat(res1[1].sh_cm, res2[1].sh_cm)
# hcat(res1[2].act, res2[2].act)
# hcat(res1[2].fat, res2[2].fat)
# hcat(res1[2].ded, res2[2].ded)
# hcat(res1[2].sh_c0, res2[2].sh_c0)
# hcat(res1[2].sh_cm, res2[2].sh_cm)
Laptop
julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
11.078 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
6.729 Ī¼s (0 allocations: 0 bytes)
PC
julia> @btime test1(x) setup = (x = deepcopy($gt)) evals = 1;
29.381 Ī¼s (0 allocations: 0 bytes)
julia> @btime test2(x) setup = (x = deepcopy($gt)) evals = 1;
26.00 Ī¼s (0 allocations: 0 bytes)
I spot-checked the returned values, it looked the same as the SVector version and on both computers.
map
/do
block) on laptop.So the original issue does seem to depend on using SVectors/@reset
.
EDIT:
Regarding the laptop vs PC issue, I thought --check-bounds=yes was the default but turns out it was auto.
Once I left off that flag, the normal vector function took test1: 9.2 Ī¼s
and test2: 5.9 Ī¼s
. Ie, 3x faster than before and slighly faster than the laptop.
For SVector it was test1: 794 ns
and test2: 7.8 Ī¼s
. Ie, results were unaffected by the forced bounds-checking.
So these benchmarks became a mess due to code-complexity and device dependencies.
But my conclusion is that using SVector and the map/do-block is 2-3x slower than manually unrolled (at least for two iterations).
For my purposes I just moved forward with the manually unrolled function, but this would be annoying for a larger loop. File a github issue?