 A macro to unroll by hand but not by hand?

Hey,
I have a computation that we already discussed there and which still takes up more than half of my total runtime (so days…).

I did found a very clunky way to cut in half it’s runtime by manually unrolling a loop, which allowed to remove some allocations. Unfortunately, this only works for a given value of M, as i have to write the code by hand (writting the loop will prevent @tturbo to work since its order do matter). Maybe there is a way to do it programatically for all M ?

Here is the code :

using StaticArrays, LoopVectorization, BenchmarkTools
function exp_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] = exp(-D[i])
zz += slack[i]
end
zz /= N
return zz
end
function prod_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] *= D[i]
zz += slack[i]
end
zz /= N
return zz
end
function sensitive_function!(rez,slack,D)
M = length(rez)
N = length(D)
rez = exp_and_mean!(slack,D,N)
for k in 2:M
rez[k] = prod_and_mean!(slack,D,N)
end
return rez
end

function v2_only_when_M_is_10(rez,D)
M = length(rez)
@assert M == 10
N = length(D)
r₁ = r₂ = r₃ = r₄ = r₅ = r₆ = r₇ = r₈ = r₉ = r₁₀ = 0.0
@tturbo for i in 1:N
Dᵢ = D[i]
eᵢ = exp(-Dᵢ)
r₁ += eᵢ
eᵢ *= Dᵢ
r₂ += eᵢ
eᵢ *= Dᵢ
r₃ += eᵢ
eᵢ *= Dᵢ
r₄ += eᵢ
eᵢ *= Dᵢ
r₅ += eᵢ
eᵢ *= Dᵢ
r₆ += eᵢ
eᵢ *= Dᵢ
r₇ += eᵢ
eᵢ *= Dᵢ
r₈ += eᵢ
eᵢ *= Dᵢ
r₉ += eᵢ
eᵢ *= Dᵢ
r₁₀+= eᵢ
end
rez .= [r₁,r₂,r₃,r₄,r₅,r₆,r₇,r₈,r₉,r₁₀] ./ N
return nothing
end

# Simple but typical use-case :
M = 10
N = 100000
D = exp.(randn(N))
slack = zero(D)
rez = MVector{M,Float64}(zeros(M))
sensitive_function!(rez,slack,D)
rez2 = MVector{M,Float64}(zeros(M))
v2_only_when_M_is_10(rez2,D)
@assert all(rez .≈ rez2)

b = @benchmark sensitive_function!(\$rez,\$slack,\$D);
display(b)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  152.000 μs …  1.111 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     296.400 μs              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   260.313 μs ± 62.834 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#    ▃▇▁▃▃▁       ▃▄▃▂▁▁▁                 ▅▄▃▄█▇▆▄▂▁             ▂
#   ▃██████▆▆▅▆▆▅▇████████▇█▆▆▅▆▅▄▅▂▄▄▄▂▃▄██████████████▇▇▇▅▆▆▆▄ █
#   152 μs        Histogram: log(frequency) by time       358 μs <
#
# Memory estimate: 0 bytes, allocs estimate: 0.

b2 = @benchmark v2_only_when_M_is_10(\$rez,\$D)
display(b2)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):   83.300 μs …  22.182 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     130.000 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   132.251 μs ± 283.493 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#   ▃ ▅                 ▂▄▃▁▅▄ ▅█▃▂▅▃▁▁                         ▁ ▁
#   █▆█▇▇▅▇▆▄▆▆▅▄▃▃▅▇▇███████████████████▇▇█▅▆▆▆▅▅▅▆▆▅▅▂▂▅▄▅▄▄▄▄█ █
#   83.3 μs       Histogram: log(frequency) by time        184 μs <

#  Memory estimate: 160 bytes, allocs estimate: 1.

# How to contruct a function like the version 2 so that it works for all M ?
# Maybe a macro that will unroll by hand the loop and produce many functions ?

I would like to construct methods for sensitive_function! that would do the same thing, with the same performance gains, but for any size M of the (static) vector rez. Is that possible ? I was thinking about a macro and code generation, but I never wrote one so I come for help.

What would be perfect is a macro that would produce methods like:

sensitive_function!(rez::MVector{1,Float64},D)
sensitive_function!(rez::MVector{2,Float64},D)
sensitive_function!(rez::MVector{3,Float64},D)
sensitive_function!(rez::MVector{4,Float64},D)
...

until a given maximum, after which the default version would work on the rest. Maybe this hack is not viable for too big values of M, when the other order of loops is still relevant, in which case we’ll fall back to the previous version.

Does this help ?
help?> Base.Cartesian.@nexprs

1 Like

It might. First time writting a macro, currently reading the docs. I have the following :

for valM = (2,3,4)
eval(quote

function test(rez::MVector{\$valM,T},D) where T
N = length(D)
r1 = 0.0
@macroexpand Base.Cartesian.@nexprs \$valM i -> ri = 0.0
@tturbo for i in 1:N
Dᵢ = D[i]
eᵢ = exp(-Dᵢ)
r1 += eᵢ
@macroexpand Base.Cartesian.@nexprs (\$valM-1) j -> eᵢ *= Dᵢ; rj += eᵢ;
end
rez .= [r1,r2,r3,r4,r5,r6,r₇,r₈,r₉,r1₀] ./ N
return nothing
end
end)
end

Which erros out yelling ERROR: LoadError: LoadError: TypeError: in typeassert, expected Symbol, got a value of type GlobalRef I think I am mixing up code genretion and macro creation. I do not want a macro exported from my code, i only want the functions to exist.

Second attemps, i think this is better but still not working

using Base.Cartesian: @nexprs
function test(rez::MVector{M,T},D) where {M,T}
N = length(D)
r1 = 0.0
@nexprs M i -> ri = 0.0
@tturbo for i in 1:N
Dᵢ = D[i]
eᵢ = exp(-Dᵢ)
r1 += eᵢ
@nexprs (M-1) j -> eᵢ *= Dᵢ; rj += eᵢ;
end
rez .= @nexprs M i -> ri/N,
return nothing
end

Still yells at me but with a different message : ERROR: LoadError: LoadError: MethodError: no method matching var"@nexprs"(::LineNumberNode, ::Module, ::Symbol, ::Expr) Maybe with @generated. Something copied from our lecture book:

@generated function absmax_auto_pack_N(a, _::Val{N}) where N
quote
#Only OK if mod(length(a),\$N)==0
@assert mod(length(a),\$N)==0
#Initialization of 8 partial results Base.Cartesian.@nexprs \$N j -> result_j = zero(eltype(a))
i=0
#Computing \$N partial results @inbounds while i < length(a)
Base.Cartesian.@nexprs \$N j -> begin
abs(a[i+j])> result_j && (result_j=abs(a[i+j]))
end
i += \$N end
#Final computing of the maximum of results
result=zero(eltype(a))
Base.Cartesian.@nexprs  \$N j -> begin
result_j > result && (result = result_j)
end
result
end end

But I am never very confident with macro compositions…

1 Like

Perhaps Unrolled.jl can help? It uses @generated under the hood.

julia> using Unrolled, StaticArrays

julia> vec = SVector{3}([100,2,7])
3-element SVector{3, Int64} with indices SOneTo(3):
100
2
7

julia> @unroll function prod_of_later_elts(vec)   # sum of all elements except the first
prod = 1
@unroll for i in 1:length(vec)
if i > 1
prod *= vec[i]
end
end
return prod
end
prod_of_later_elts_unrolled_expansion_ (generic function with 1 method)

julia> prod_of_later_elts(vec)
14

julia> @code_unrolled prod_of_later_elts(vec)
quote
prod = 1
begin
let i = 1
if i > 1
prod *= vec[i]
end
end
let i = 2
if i > 1
prod *= vec[i]
end
end
let i = 3
if i > 1
prod *= vec[i]
end
end
end
return prod
end

julia> @btime prod_of_later_elts(\$vec)
1.538 ns (0 allocations: 0 bytes)
14

EDIT: Interestingly, even without @unrolled I get the same performance and same @code_llvm, so it looks like the loop was unrolled by the compiler in this case.

The thing is that i need this macro to happend before the resolution of @tturbo, which probably does many things to my code to vetorize it correctly. Can i ensure somehow that my expensions are happening before @tturbo ?

Not sure that @unroll will work in my case, as i want specific variables names that are local, and i do not want to acess to a vector.

Ok this is clearly not working as i think it is :

julia> @macroexpand @nexprs 4 i -> begin
ri = 0.0
end
quote
begin
#= REPL:1 =#
#= REPL:2 =#
ri = 0.0
end
begin
#= REPL:1 =#
#= REPL:2 =#
ri = 0.0
end
begin
#= REPL:1 =#
#= REPL:2 =#
ri = 0.0
end
begin
#= REPL:1 =#
#= REPL:2 =#
ri = 0.0
end
end

julia>

I think I kinda need help It’s a bit hard to understand your code. What is the generic (but very slow) version of v2_only_when_M_is_10, that works from any M? We’ll likely have a better time optimizing that function than working backwards from v2_only_when_M_is_10

1 Like

Of course, I hear you. Here are two different expression which computes the same thing, but quite slow :

one_liner(D,m) = [sum(exp.(.-D).*D.^(k-1))/length(D) for k in 1:m]
rez = zeros(m)
N = length(D)
for k in 1:m
rez[k] = sum(exp.(-D) .* D.^(k-1))/N
end
return rez
end

The difference with the version from the first post is that I re-computes powers of D at each iteration, while in the first post I compute them incrementally.

You can of course check that these returns the same thing. The fastest version I have for the moment is this v2_only_when_M_is_10 so that i wanted to focus on this one to generalize it a little bit, but of course if you have another faster one, i’ll take it Tell me if it is still not clear.

1 Like

Here is one way to use Base.Cartesian macros to explicitly unroll your loops in the same way you did manually:

const C = Base.Cartesian
@generated function v2(D, ::Val{M}) where {M}
quote
C.@nexprs \$M k -> r_k = 0.0
for i in 1:length(D)
Di = D[i]
ei = exp(-Di)
C.@nexprs \$M k -> begin
r_k += ei
ei *= Di
end
end
C.@ntuple \$M k -> r_k / length(D)
end
end

we can check that it generates the same kind of code you hand-wrote:

julia> @code_lowered v2(D, Val(4))
lowered code
CodeInfo(
@ REPL:1 within `v2`
┌ @ REPL:1 within `macro expansion`
1 ─│       Core.NewvarNode(:(val))
│  └
│  ┌ @ REPL:3 within `macro expansion`
│  │       r_1 = 0.0
│  │       r_2 = 0.0
│  │       r_3 = 0.0
│  │       r_4 = 0.0
│  └
│  ┌ @ REPL:4 within `macro expansion`
│  │ %7  = Main.length(D)
│  │ %8  = 1:%7
│  │       @_4 = Base.iterate(%8)
│  │ %10 = @_4 === nothing
│  │ %11 = Base.not_int(%10)
└──│       goto #4 if not %11
2 ┄│ %13 = @_4
│  │       i = Core.getfield(%13, 1)
│  │ %15 = Core.getfield(%13, 2)
│  └
│  ┌ @ REPL:5 within `macro expansion`
│  │       Di = Base.getindex(D, i)
│  └
│  ┌ @ REPL:6 within `macro expansion`
│  │ %17 = -Di
│  │       ei = Main.exp(%17)
│  └
│  ┌ @ REPL:8 within `macro expansion`
│  │       r_1 = r_1 + ei
│  └
│  ┌ @ REPL:9 within `macro expansion`
│  │       ei = ei * Di
│  └
│  ┌ @ REPL:8 within `macro expansion`
│  │       r_2 = r_2 + ei
│  └
│  ┌ @ REPL:9 within `macro expansion`
│  │       ei = ei * Di
│  └
│  ┌ @ REPL:8 within `macro expansion`
│  │       r_3 = r_3 + ei
│  └
│  ┌ @ REPL:9 within `macro expansion`
│  │       ei = ei * Di
│  └
│  ┌ @ REPL:8 within `macro expansion`
│  │       r_4 = r_4 + ei
│  └
│  ┌ @ REPL:9 within `macro expansion`
│  │       ei = ei * Di
│  │       @_4 = Base.iterate(%8, %15)
│  │ %28 = @_4 === nothing
│  │ %29 = Base.not_int(%28)
└──│       goto #4 if not %29
3 ─│       goto #2
4 ┄│       val = nothing
│  │       \$(Expr(:inbounds, :pop))
│  │       val
│  └
│  ┌ @ REPL:12 within `macro expansion`
│  │ %35 = r_1
│  │ %36 = Main.length(D)
│  │ %37 = %35 / %36
│  │ %38 = r_2
│  │ %39 = Main.length(D)
│  │ %40 = %38 / %39
│  │ %41 = r_3
│  │ %42 = Main.length(D)
│  │ %43 = %41 / %42
│  │ %44 = r_4
│  │ %45 = Main.length(D)
│  │ %46 = %44 / %45
│  │ %47 = Core.tuple(%37, %40, %43, %46)
└──│       return %47
└
)

And we can also check that it gives the same results as a naive version (up to round-off errors):

v1(D, M) = ntuple(M) do k
sum(exp.(.-D) .* D.^(k-1))/length(D)
end
julia> using BenchmarkTools

julia> D = rand(1000);

julia> @btime v1(\$D, 4)
27.832 μs (5 allocations: 31.80 KiB)
(0.6285852677117515, 0.2655407488267991, 0.1637959117518052, 0.11703117408527089)

julia> @btime v2(\$D, Val(4))
6.037 μs (0 allocations: 0 bytes)
(0.6285852677117516, 0.265540748826799, 0.16379591175180525, 0.11703117408527078)

That being said, I’m not sure whether this will compose well with @tturbo… unless LoopVectorization does something to expand your code before it does its magic. If things don’t work out too well in compination with @tturbo, there might still be ways to explicitly handle macro expansion order ourselves, but let’s not get into this unless it’s necessary.

4 Likes

Thanks a lot for taking the time. I’ve adapted a little to keep the same interface I had and here are the results on my machine(with reworked code so that we can start on the same base):

using StaticArrays
using LoopVectorization
using BenchmarkTools

M = 10
N = 100000
D = exp.(randn(N))
rez = MVector{M,Float64}(zeros(M))

function v1(rez,D)
rez .= [sum(exp.(.-D) .* D.^(k-1))/length(D) for k in 1:length(rez)]
return nothing
end

v1(rez,D)
truth = deepcopy(rez)
@benchmark v1(\$rez,\$D)
# BenchmarkTools.Trial: 186 samples with 1 evaluation.
#  Range (min … max):  22.882 ms … 40.312 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     26.545 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   26.874 ms ±  2.500 ms  ┊ GC (mean ± σ):  1.35% ± 2.84%

#      ▂ ▁▂▄  ▁  ▁ ██▇ ▂▂ ▅▂▅▅▂  ▁     ▂
#   ▃▆▃█▆████████████████▆█████▅▅█▅▆▅▃▅█▆▁▁▅▁▃▁▁▃▃▅▁▁▁▃▁▁▁▃▁▁▁▃ ▃
#   22.9 ms         Histogram: frequency by time        34.5 ms <

#  Memory estimate: 7.63 MiB, allocs estimate: 21.

function exp_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] = exp(-D[i])
zz += slack[i]
end
zz /= N
return zz
end
function prod_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] *= D[i]
zz += slack[i]
end
zz /= N
return zz
end
function original!(rez,slack,D)
M = length(rez)
N = length(D)
rez = exp_and_mean!(slack,D,N)
for k in 2:M
rez[k] = prod_and_mean!(slack,D,N)
end
return rez
end
slack = similar(D)
original!(rez,slack,D)
@assert rez ≈ truth
@benchmark original!(\$rez,\$slack,\$D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  104.800 μs … 805.000 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     132.900 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   160.863 μs ±  50.199 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#     ▂█▆▄                                ▁
#   ▃▇█████▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▃▂▃▃▄▄▃▄▄▇▆▅█▄▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   105 μs           Histogram: frequency by time          287 μs <

#  Memory estimate: 0 bytes, allocs estimate: 0.

function v2_only_when_M_is_10(rez,D)
M = length(rez)
@assert M == 10
N = length(D)
r1 = r2 = r3 = r4 = r5 = r6 = r7 = r8 = r9 = r10 = 0.0
@tturbo for i in 1:N
Dᵢ = D[i]
eᵢ = exp(-Dᵢ)
r1 += eᵢ
eᵢ *= Dᵢ
r2 += eᵢ
eᵢ *= Dᵢ
r3 += eᵢ
eᵢ *= Dᵢ
r4 += eᵢ
eᵢ *= Dᵢ
r5 += eᵢ
eᵢ *= Dᵢ
r6 += eᵢ
eᵢ *= Dᵢ
r7 += eᵢ
eᵢ *= Dᵢ
r8 += eᵢ
eᵢ *= Dᵢ
r9 += eᵢ
eᵢ *= Dᵢ
r10+= eᵢ
end
rez .= [r1,r2,r3,r4,r5,r6,r7,r8,r9,r10] ./ N
return nothing
end
v2_only_when_M_is_10(rez,D)
@assert rez ≈ truth
@benchmark v2_only_when_M_is_10(\$rez,\$D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max):  53.300 μs … 613.700 μs  ┊ GC (min … max): 0.00% … 0.00%
# Time  (median):     62.900 μs               ┊ GC (median):    0.00%
# Time  (mean ± σ):   71.640 μs ±  16.417 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#  ▃ ▃ ▂▃▇▄█▂▅▁▃▂▄▁▁ ▁ ▁    ▁ ▁    ▄▂ ▂ ▃▃ ▅▄ ▃▄ ▂▃  ▂▃  ▄▂  ▁▁ ▂
#  ███▅█████████████▇██████▇█████▇▇██▅█████████████▇▅██▇▇██▆▇██ █
#  53.3 μs       Histogram: log(frequency) by time       102 μs <

# Memory estimate: 160 bytes, allocs estimate: 1.

const C = Base.Cartesian
@generated function v2_ffevote(rez::MVector{M,T},D) where {M,T}
quote
N = length(D)
C.@nexprs \$M k -> r_k = 0.0
for i in 1:length(D)
Di = D[i]
ei = exp(-Di)
C.@nexprs \$M k -> begin
r_k += ei
ei *= Di
end
end
C.@nexprs \$M k -> rez[k] = r_k / N
return nothing
end
end
v2_ffevote(rez,D)
@assert rez ≈ truth
@benchmark v2_ffevote(\$rez,\$D)
# BenchmarkTools.Trial: 5767 samples with 1 evaluation.
#  Range (min … max):  790.100 μs …  2.033 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     830.700 μs              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   862.798 μs ± 97.176 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#   ▂ █▃▆▄▅▄▄▃▃▃▂▂▁▁▁▁▁▁▁▁▁▁                                     ▁
#   █▇█████████████████████████████▇▇▆▇▇▆▆▆▅▆▆▆▅▆▅▅▄▅▃▁▄▅▅▅▄▁▁▃▅ █
#   790 μs        Histogram: log(frequency) by time      1.31 ms <

#  Memory estimate: 0 bytes, allocs estimate: 0.

@generated function v2_ffevote_turbo(rez::MVector{M,T},D) where {M,T}
quote
N = length(D)
C.@nexprs \$M k -> r_k = 0.0
@tturbo for i in 1:length(D)
Di = D[i]
ei = exp(-Di)
C.@nexprs \$M k -> begin
r_k += ei
ei *= Di
end
end
C.@nexprs \$M k -> rez[k] = r_k / N
return nothing
end
end
v2_ffevote_turbo(rez,D)
@assert rez ≈ truth
@benchmark v2_ffevote_turbo(\$rez,\$D)
# Range (min … max):   65.500 μs … 844.100 μs  ┊ GC (min … max): 0.00% … 0.00%
# Time  (median):     119.700 μs               ┊ GC (median):    0.00%
# Time  (mean ± σ):   114.916 μs ±  25.891 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#  ▄▂▃▁▄▁▄▅▂▃▁     ▂▃ ▃ ▄   ▅▅▆▆▂▇▃▅█▃▃▃▁                        ▂
#  █████████████▇▇▇██▇█████████████████████▇▇▆▇▆▇▇▇▇▅▇▆▅▅▇▆▄▅▄▄▅ █
#  65.5 μs       Histogram: log(frequency) by time        182 μs <

# Memory estimate: 0 bytes, allocs estimate: 0.

In summary, although the @code_lowered looks like it made what we wanted it to make, it interferes a little with @tturbo and takes about 50% more time (median and mean cases).

If there is indeed a way to control the order of macros, and if it is not too painfull, maybe there is still something to do here… However taking a look at the code_lowered myself makes me say they are the same, which is clearly not true: I’m missing the difference…

Edit: or maybe there are the same, the benchmark timings are changing a lot from a run to the other… Is there a way to check if two functions have the same assembly code ?

Looks like the generation in the middle of the loop :

C.@nexprs \$M k -> begin
r_k += ei
ei *= Di
end

Does one to much ei *= Di at the end (the loop should end with a ri += ei. But this is me being silly…

@Elrod Any ideas how to get an “unroll” macro to compose with @tturbo?

It should already do that. Just looking at the code, they should be the same, but I can double check on the morning.

1 Like

Ok i ended up with the folliwng to fix the fact that the last ei *= Di was not needed :

const C = Base.Cartesian
@generated function v2_ffevote_turbo(rez::MVector{M,T},D) where {M,T}
quote
N = length(D)
C.@nexprs \$M k -> r_k = 0.0
@tturbo for i in 1:length(D)
Di = D[i]
ei = exp(-Di)
r_1 += ei
C.@nexprs \$(M-1) k -> begin
ei *= Di
r_{k+1} += ei
end
end
C.@nexprs \$M k -> rez[k] = r_k / N
return nothing
end
end

Works perfectly for M <= 10, with the announced 50% speedup However, for e.g. M=15 or M=20, I never managed to compile the function. It takes forever and even crashd Julia sometimes. I suspect this is because @tturbo is under a lot of stress due to the number of variables to handle, what do you think @Elrod ?

I actually already get segfaults for M=10.

Basically, @tturbo is trying to store more variables in the buffer than actually fit, causing the crashes.
For each variable, it requires a number of bytes equal to the SIMD width. That is 64 for AVX512, or 32 for AVX2.
It also needs some of the storage for other things, so I’d expect only 7 variables for AVX512 and 15 for AVX2. I’d have to double check, but if it’s already crashing for M=15, I guess it’s using a bit more than that.
You could increase the buffer.
Or maybe, because it’s possible to check at compile time if it fits, it could use a check itself.

The fundamental problem with LoopVectorization.jl here is that it currently only handles “loop independent” dependencies and not “loop carried” dependencies aside from reductions.
Loop independent dependencies are those that only within a loop iteration, and thus that don’t stop you from reordering the iterations arbitrarily.
Loop carried are those between iterations.

If it handled loop carried dependencies, you could write the loading/storing to rez as part of the loop and still get correct answers. It should then also be able to optimize the situation better when M is large (by not unrolling it entirely).
I’m working on support for this, but I’d expect it to take a long time, as it is part of a ground up rewrite of the library.

3 Likes

Yep, the problem is that here the outer loop has no dependancy, and the inner one (that I unroll) has loop carried dependencies, which is why I am unrolling it. Thanks for the explanations anyway !

I’ll check the buffer size hack tomorrow and come back to you.

I tried deving the package to increase the buffer size, and it works perfectly, with huge speedups :

M = 30
N = 1000
D = exp.(randn(N))
slack = similar(D)
n_reps = 1000
rates = zeros(M)
for m in 1:M
rr = MVector{m+1,Float64}(zeros(m+1))
x = mean(@elapsed original!(rr,slack,D) for i in 1:n_reps)
y = mean(@elapsed v2_ffevote_turbo(rr,D) for i in 1:n_reps)
rates[m] = x/y
end

using UnicodePlots
lineplot(rates .- 1)
# ┌────────────────────────────────────────┐
# 5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⠀⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⠉⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡟⡄⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠇⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⡇⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠦⡀⠀⡜⠀⠀⢸⢀⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡸⠀⢇⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⠃⠀⠈⠉⠁⠀⠀⠸⠎⢸⠀⠀⠀⡖⢤⠀⠀⠀⠀⠀⠀⡇⠀⢸⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡜⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢇⠀⡸⠀⠸⡀⠀⠀⢀⠀⡸⠀⠀⢸⡀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⡔⠒⡆⠀⢠⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⠇⠀⠀⡇⠀⡎⠁⠉⠃⠀⠀⠀⡇⠀⠀⡠│
#   │⠀⠀⠀⠀⠀⢰⠁⠀⠘⡤⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢣⢰⠁⠀⠀⠀⠀⠀⠀⢱⠀⡰⠁│
#   │⠀⡄⠀⢀⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡎⠀⠀⠀⠀⠀⠀⠀⠸⣠⠃⠀│
#   │⠀⠸⡀⡜⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠃⠀⠀⠀⠀⠀⠀⠀⠀⠏⠀⠀│
#   │⠀⠀⢳⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
#   │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
# 0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
#   └────────────────────────────────────────┘
#   ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀30

Questions :

• Is there a way to increase the buffer size without deving the ThreadingUtilities package ?
• What are the implications of such a buffer increase ?
• Would JuliaSIMD consider increasing it in the regitered package ?

Awyway, thanks to you all, I’ve won a few days of computations Commenting on the original point, you don’t need a macro or manually expanded code for unrolling. A recent PR to Julia Faster reduction for min max. by N5N3 · Pull Request #43725 · JuliaLang/julia · GitHub is a good self-contained concrete example of how you can write non-trivial “unrolling” with just plain functions.

Since functions compose well, it’s also easy to extend this idea to highly composable frameworks; e.g., GitHub - JuliaFolds/LoopRecipes.jl (experimental)

3 Likes

Thanks for the reminder, this is really a good piece of advice.

I actually tried suggesting an ntuple-based solution right from the beginning of the thread, but in this particular instance I couldn’t see how to make it work… The main issue I encountered was how to create an ntuple containing the successive powers of Di (i.e. the values stored in ei in the original code).

I think what we’re missing here is some form of scanl function that would be ntuple-friendly… but I didn’t find one, and didn’t see any way to write one without generated functions and Base.Cartesian macros, which obviously defeats the purpose of avoiding such techniques. Did I miss something?

Since a piece of code is worth a thousand words, my original attempt was along the lines of:

const C = Base.Cartesian

"""
scanl(op, x, init)

`x` being an N-tuple, return a (N+1)-tuple containing the reduction of all
front-sequences of x with binary operation `op` and initiatization element
`init`.

scanl(⊙, x, o)
== (o,
o ⊙ x,
(o ⊙ x) ⊙ x,
((o ⊙ x) ⊙ x) ⊙ x,
...)

For example, `scanl(+, x, 0)` returns the cumulative sum of all elements in `x`:

julia> scanl(+, (1,2,3,4), 0)
(0, 1, 3, 6, 10)
"""
@generated function scanl(op, x::NTuple{N,T}, init) where {N, T}
quote
r_1 = init
C.@nexprs \$N i -> r_{i+1} = op(r_i, x[i])
C.@ntuple \$(N+1) i -> r_i
end
end

function v3(D, ::Val{M}) where {M}
r = ntuple(_->0.0, M)
for i in 1:length(D)
Di = ntuple(_->D[i], Val(M-1))
e = scanl2(*, Di, exp(-D[i]))
r = r .+ e
end
r ./ length(D)
end

This seems to be as efficient as directly unrolling by hand, but of course it is only useful if such scanl function can be found somewhere or re-implemented without macros.

1 Like

The function I was looking for is accumulate, so here is an implementation of the original function that should be unrolled and does not depend on any macro or generated function.

function v3(D, ::Val{M}) where {M}
r = ntuple(_->0.0, Val(M))
@inbounds for i in 1:length(D)
Di = (exp(-D[i]), ntuple(_->D[i], Val(M-1))...)
e = accumulate(*, Di)
r = r .+ e
end
r ./ length(D)
end

It looks like it is as fast as the @generated version above:

julia> D = rand(1000);
julia> @assert all(v1(D, Val(4)) .≈ v2(D, Val(4)) .≈ v3(D, Val(4)))

julia> using BenchmarkTools
julia> @btime v1(\$D, Val(4))
29.818 μs (4 allocations: 31.75 KiB)
(0.6305786671481367, 0.26563175951466034, 0.16097293038735294, 0.1138187586342532)

julia> @btime v2(\$D, Val(4))
6.035 μs (0 allocations: 0 bytes)
(0.6305786671481375, 0.26563175951466045, 0.16097293038735302, 0.11381875863425313)

julia> @btime v3(\$D, Val(4))
5.891 μs (0 allocations: 0 bytes)
(0.6305786671481375, 0.26563175951466045, 0.16097293038735302, 0.11381875863425313)
complete code for reference
v1(D, M) = ntuple(M) do k
sum(exp.(.-D) .* D.^(k-1))/length(D)
end

const C = Base.Cartesian
@generated function v2(D, ::Val{M}) where {M}
quote
C.@nexprs \$M k -> r_k = 0.0
@inbounds for i in 1:length(D)
Di = D[i]
ei = exp(-Di)
C.@nexprs \$M k -> begin
r_k += ei
ei *= Di
end
end
C.@ntuple \$M k -> r_k / length(D)
end
end

function v3(D, ::Val{M}) where {M}
r = ntuple(_->0.0, Val(M))
@inbounds for i in 1:length(D)
Di = (exp(-D[i]), ntuple(_->D[i], Val(M-1))...)
e = accumulate(*, Di)
r = r .+ e
end
r ./ length(D)
end

D = rand(1000);

@assert all(v1(D, Val(4)) .≈ v2(D, Val(4)) .≈ v3(D, Val(4)))

using BenchmarkTools
@btime v1(\$D, Val(4))
@btime v2(\$D, Val(4))
@btime v3(\$D, Val(4))
3 Likes

Thanks @ffevote, but this is awfully slow (three orders of magnitudes slower than the others, miliseconds and not microseconds for M=30 on my machine) because it does not allow @tturbo, and does not SIMD correctly.

Replacing @inbounds by @tturbo gives :

julia> function v3(D, ::Val{M}) where {M}
r = ntuple(_->0.0, Val(M))
@tturbo for i in 1:length(D)
Di = (exp(-D[i]), ntuple(_->D[i], Val(M-1))...)
e = accumulate(*, Di)
r = r .+ e
end
r ./ length(D)
end
(exp(-(D[i])), ntuple((_->begin
D[i]
end), Val(M - 1))...)
Stacktrace:
 add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, elementbytes::Int64, position::Int64)
@ LoopVectorization ~\.julia\packages\LoopVectorization\kVenK\src\modeling\graphs.jl:1130
 add_assignment!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, elementbytes::Int64, position::Int64)
@ LoopVectorization ~\.julia\packages\LoopVectorization\kVenK\src\modeling\graphs.jl:1235
 push!(ls::LoopVectorization.LoopSet, ex::Expr, elementbytes::Int64, position::Int64, mpref::Nothing)
@ LoopVectorization ~\.julia\packages\LoopVectorization\kVenK\src\modeling\graphs.jl:1271
in expression starting at c:\Users\u009192\Documents\project_ggchd\ggchd_julia\test_macro.jl:56
in expression starting at c:\Users\u009192\Documents\project_ggchd\ggchd_julia\test_macro.jl:54

which i do not understand.

On the other hand, this is a beautifull piece of code 