# Help optimizing simple example

Hi,

I would appreciate if anyone could suggest an optimized version of my naïve implementation (via a loop) of the example below, which computes matrix `W`. This is part of an inner algorithm that is repeated thousands of times, and so I would really appreciate if you could help me tune it in.

Here we have dummy data `w` and `x` and take as given the contents of function `transf`. The final object is `W`, which is a `N*T*M` by `k` matrix. Function `transf` is just an example, the only important thing about it is that it takes a number `x` and returns a `k`-column vector.

``````  N = 1000
T = 3
M = 30
k = 5

w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)

function transf(x)

transform = fill(NaN, k)

for j ∈ 1:k
transform[j] = x^j
end

return transform
end

W_wide = fill(NaN, M, N, T, k)

for m ∈ 1:M

for i ∈ 1:N
W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m]))
end
end

W = reshape(W_wide, N*T*M, k)
``````

You should probably reorder everything so the fastest-changing dimension (k) is the first, because that is optimal for cache access in Julia, which has column-major storage. And you might be able to gain something by using StaticArrays as k and T are small. And things like `w[i, :]` should be `view(w, i, :)` as otherwise you keep making copies with those slices.

3 Likes

Nice exercise for a benchmark. The MWE

``````using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio

function transf(x, k)
transform = fill(NaN, k)
for j ∈ 1:k
transform[j] = x^j
end
return transform
end

function test1()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = fill(NaN, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m], k))
end
end
W = reshape(W_wide, N*T*M, k)
end

function test2()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
xj = x1 = x[i,m]
for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*xj
end
xj *= x1
end
end
end
W = reshape(W_wide, N*T*M, k)
end

function test3()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
@turbo for m ∈ 1:M
for i ∈ 1:N
for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
end
end
end
end
W = reshape(W_wide, N*T*M, k)
end

function test4()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
@tullio W_wide[m,i,t,j] := w[i,t]*x[i,m]^j (j ∈ 1:k)
W = reshape(W_wide, N*T*M, k)
end

function test5()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
@tturbo for m ∈ 1:M
for i ∈ 1:N
for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
end
end
end
end
W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 ≈ result1
Random.seed!(42)
result3 = test3()
@assert result3 ≈ result2
Random.seed!(42)
result4 = test4()
@assert result4 ≈ result3
Random.seed!(42)
result5 = test5()
@assert result5 ≈ result4

@btime test1() setup=(Random.seed!(42))
@btime test2() setup=(Random.seed!(42))
@btime test3() setup=(Random.seed!(42))
@btime test4() setup=(Random.seed!(42))
@btime test5() setup=(Random.seed!(42))
``````

yields

``````  9.229 ms (120009 allocations: 15.13 MiB) # baseline
1.883 ms (8 allocations: 3.69 MiB) # manually optimized
856.100 μs (8 allocations: 3.69 MiB) # @turbo
696.600 μs (60 allocations: 3.69 MiB) # @tullio
465.600 μs (8 allocations: 3.69 MiB) # @tturbo
``````

Using multi threading might be slight overkill here (as I’m throwing 5 cores on the problem)…

Edit: I didn’t sort out if there is a better memory access order.

1 Like

Thank you! I forgot to add that I need to call `transf` (since in the actual code this is a much more complicated function), so in reality taking `x[i,m]^j` out of the function is off the table for me. If I am not mistaken, this limits using `LoopVectorization` and `@tturbo`

Which degree of freedoms do you have regarding the interface of `transf` ? Can you compute the `j`-th element independently or do you need to return a complete vector? If you need a vector are `SVectors` appropriate (i.e. is size 5 realistic)?

Edit: something like this

``````function transf2(x, k, j)
x^j
end
``````

still seems improvable with `LoopVectorization` and `Tullio`.

Assuming you need to return at least an `SVector` leaves me with this

``````using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio
using StaticArrays

function transf1(x, k)
transform = fill(NaN, k)
for j ∈ 1:k
transform[j] = x^j
end
return transform
end

function test1()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = fill(NaN, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
W_wide[m,i,:,:] = w[i,:]*transpose(transf1(x[i,m], k))
end
end
W = reshape(W_wide, N*T*M, k)
end

function transf2(x, k)
SVector((x^j for j ∈ 1:k)...)
end

function test2()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
v = transf2(x[i, m], k)
for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*v[j]
end
end
end
end
W = reshape(W_wide, N*T*M, k)
end

function test3()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
v = transf2(x[i, m], k)
@turbo for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*v[j]
end
end
end
end
W = reshape(W_wide, N*T*M, k)
end

function test4()
N = 1000
T = 3
M = 30
k = 5
w = rand(Normal(0,1), N, T)
x = rand(Normal(0,1), N, M)
W_wide = Array{Float64}(undef, M, N, T, k)
for m ∈ 1:M
for i ∈ 1:N
v = transf2(x[i, m], k)
@tturbo for j ∈ 1:k
for t ∈ 1:T
W_wide[m,i,t,j] = w[i,t]*v[j]
end
end
end
end
W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 ≈ result1
Random.seed!(42)
result3 = test3()
@assert result3 ≈ result2
Random.seed!(42)
result4 = test4()
@assert result4 ≈ result3
``````

yielding

``````  9.086 ms (120009 allocations: 15.13 MiB) # baseline
2.073 ms (8 allocations: 3.69 MiB) # manually optimized
2.188 ms (8 allocations: 3.69 MiB) # @turbo
2.810 ms (8 allocations: 3.69 MiB) # @tturbo
``````

Here `LoopVectorization` seemingly allows only to annotate the inner loops, which doesn’t help anymore. And my `Tullio`-fu is too weak for this construction.

1 Like