Help optimizing simple example

Hi,

I would appreciate if anyone could suggest an optimized version of my naΓ―ve implementation (via a loop) of the example below, which computes matrix W. This is part of an inner algorithm that is repeated thousands of times, and so I would really appreciate if you could help me tune it in.

Here we have dummy data w and x and take as given the contents of function transf. The final object is W, which is a N*T*M by k matrix. Function transf is just an example, the only important thing about it is that it takes a number x and returns a k-column vector.

  N = 1000
  T = 3
  M = 30
  k = 5

  w = rand(Normal(0,1), N, T)
  x = rand(Normal(0,1), N, M)

  function transf(x)

      transform = fill(NaN, k)

      for j ∈ 1:k
          transform[j] = x^j
      end

      return transform
  end

W_wide = fill(NaN, M, N, T, k)

  for m ∈ 1:M
  
      for i ∈ 1:N
          W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m]))
      end
  end

W = reshape(W_wide, N*T*M, k)

You should probably reorder everything so the fastest-changing dimension (k) is the first, because that is optimal for cache access in Julia, which has column-major storage. And you might be able to gain something by using StaticArrays as k and T are small. And things like w[i, :] should be view(w, i, :) as otherwise you keep making copies with those slices.

3 Likes

Nice exercise for a benchmark. The MWE

using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio

function transf(x, k)
    transform = fill(NaN, k)
    for j ∈ 1:k
        transform[j] = x^j
    end
    return transform
end

function test1()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = fill(NaN, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m], k))
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test2()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            xj = x1 = x[i,m]
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*xj
                end
                xj *= x1
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test3()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    @turbo for m ∈ 1:M
        for i ∈ 1:N
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test4()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    @tullio W_wide[m,i,t,j] := w[i,t]*x[i,m]^j (j ∈ 1:k)
    W = reshape(W_wide, N*T*M, k)
end

function test5()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    @tturbo for m ∈ 1:M
        for i ∈ 1:N
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 β‰ˆ result1
Random.seed!(42)
result3 = test3()
@assert result3 β‰ˆ result2
Random.seed!(42)
result4 = test4()
@assert result4 β‰ˆ result3
Random.seed!(42)
result5 = test5()
@assert result5 β‰ˆ result4

@btime test1() setup=(Random.seed!(42))
@btime test2() setup=(Random.seed!(42))
@btime test3() setup=(Random.seed!(42))
@btime test4() setup=(Random.seed!(42))
@btime test5() setup=(Random.seed!(42))

yields

  9.229 ms (120009 allocations: 15.13 MiB) # baseline
  1.883 ms (8 allocations: 3.69 MiB) # manually optimized
  856.100 ΞΌs (8 allocations: 3.69 MiB) # @turbo
  696.600 ΞΌs (60 allocations: 3.69 MiB) # @tullio
  465.600 ΞΌs (8 allocations: 3.69 MiB) # @tturbo

Using multi threading might be slight overkill here (as I’m throwing 5 cores on the problem)…

Edit: I didn’t sort out if there is a better memory access order.

1 Like

Thank you! I forgot to add that I need to call transf (since in the actual code this is a much more complicated function), so in reality taking x[i,m]^j out of the function is off the table for me. If I am not mistaken, this limits using LoopVectorization and @tturbo…

Which degree of freedoms do you have regarding the interface of transf ? Can you compute the j-th element independently or do you need to return a complete vector? If you need a vector are SVectors appropriate (i.e. is size 5 realistic)?

Edit: something like this

function transf2(x, k, j)
    x^j
end

still seems improvable with LoopVectorization and Tullio.

Assuming you need to return at least an SVector leaves me with this

using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio
using StaticArrays

function transf1(x, k)
    transform = fill(NaN, k)
    for j ∈ 1:k
        transform[j] = x^j
    end
    return transform
end

function test1()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = fill(NaN, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            W_wide[m,i,:,:] = w[i,:]*transpose(transf1(x[i,m], k))
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function transf2(x, k)
    SVector((x^j for j ∈ 1:k)...)
end

function test2()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test3()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            @turbo for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test4()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            @tturbo for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 β‰ˆ result1
Random.seed!(42)
result3 = test3()
@assert result3 β‰ˆ result2
Random.seed!(42)
result4 = test4()
@assert result4 β‰ˆ result3

yielding

  9.086 ms (120009 allocations: 15.13 MiB) # baseline
  2.073 ms (8 allocations: 3.69 MiB) # manually optimized
  2.188 ms (8 allocations: 3.69 MiB) # @turbo
  2.810 ms (8 allocations: 3.69 MiB) # @tturbo

Here LoopVectorization seemingly allows only to annotate the inner loops, which doesn’t help anymore. And my Tullio-fu is too weak for this construction.

1 Like