Help optimizing simple example

Dan.phi · March 27, 2022, 4:03pm

Hi,

I would appreciate if anyone could suggest an optimized version of my naïve implementation (via a loop) of the example below, which computes matrix W. This is part of an inner algorithm that is repeated thousands of times, and so I would really appreciate if you could help me tune it in.

Here we have dummy data w and x and take as given the contents of function transf. The final object is W, which is a N*T*M by k matrix. Function transf is just an example, the only important thing about it is that it takes a number x and returns a k-column vector.

  N = 1000
  T = 3
  M = 30
  k = 5

  w = rand(Normal(0,1), N, T)
  x = rand(Normal(0,1), N, M)

  function transf(x)

      transform = fill(NaN, k)

      for j ∈ 1:k
          transform[j] = x^j
      end

      return transform
  end

W_wide = fill(NaN, M, N, T, k)

  for m ∈ 1:M
  
      for i ∈ 1:N
          W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m]))
      end
  end

W = reshape(W_wide, N*T*M, k)

jules · March 27, 2022, 4:47pm

You should probably reorder everything so the fastest-changing dimension (k) is the first, because that is optimal for cache access in Julia, which has column-major storage. And you might be able to gain something by using StaticArrays as k and T are small. And things like w[i, :] should be view(w, i, :) as otherwise you keep making copies with those slices.

goerch · March 27, 2022, 5:52pm

Nice exercise for a benchmark. The MWE

using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio

function transf(x, k)
    transform = fill(NaN, k)
    for j ∈ 1:k
        transform[j] = x^j
    end
    return transform
end

function test1()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = fill(NaN, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            W_wide[m,i,:,:] = w[i,:]*transpose(transf(x[i,m], k))
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test2()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            xj = x1 = x[i,m]
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*xj
                end
                xj *= x1
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test3()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    @turbo for m ∈ 1:M
        for i ∈ 1:N
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test4()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    @tullio W_wide[m,i,t,j] := w[i,t]*x[i,m]^j (j ∈ 1:k)
    W = reshape(W_wide, N*T*M, k)
end

function test5()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    @tturbo for m ∈ 1:M
        for i ∈ 1:N
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*x[i,m]^j
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 ≈ result1
Random.seed!(42)
result3 = test3()
@assert result3 ≈ result2
Random.seed!(42)
result4 = test4()
@assert result4 ≈ result3
Random.seed!(42)
result5 = test5()
@assert result5 ≈ result4

@btime test1() setup=(Random.seed!(42))
@btime test2() setup=(Random.seed!(42))
@btime test3() setup=(Random.seed!(42))
@btime test4() setup=(Random.seed!(42))
@btime test5() setup=(Random.seed!(42))

yields

  9.229 ms (120009 allocations: 15.13 MiB) # baseline
  1.883 ms (8 allocations: 3.69 MiB) # manually optimized
  856.100 μs (8 allocations: 3.69 MiB) # @turbo
  696.600 μs (60 allocations: 3.69 MiB) # @tullio
  465.600 μs (8 allocations: 3.69 MiB) # @tturbo

Using multi threading might be slight overkill here (as I’m throwing 5 cores on the problem)…

Edit: I didn’t sort out if there is a better memory access order.

Dan.phi · March 27, 2022, 7:26pm

Thank you! I forgot to add that I need to call transf (since in the actual code this is a much more complicated function), so in reality taking x[i,m]^j out of the function is off the table for me. If I am not mistaken, this limits using LoopVectorization and @tturbo…

goerch · March 27, 2022, 7:43pm

Which degree of freedoms do you have regarding the interface of transf ? Can you compute the j-th element independently or do you need to return a complete vector? If you need a vector are SVectors appropriate (i.e. is size 5 realistic)?

Edit: something like this

function transf2(x, k, j)
    x^j
end

still seems improvable with LoopVectorization and Tullio.

goerch · March 27, 2022, 8:12pm

Assuming you need to return at least an SVector leaves me with this

using BenchmarkTools
using Random
using Distributions
using LoopVectorization
using Tullio
using StaticArrays

function transf1(x, k)
    transform = fill(NaN, k)
    for j ∈ 1:k
        transform[j] = x^j
    end
    return transform
end

function test1()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = fill(NaN, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            W_wide[m,i,:,:] = w[i,:]*transpose(transf1(x[i,m], k))
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function transf2(x, k)
    SVector((x^j for j ∈ 1:k)...)
end

function test2()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test3()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            @turbo for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

function test4()
    N = 1000
    T = 3
    M = 30
    k = 5
    w = rand(Normal(0,1), N, T)
    x = rand(Normal(0,1), N, M)
    W_wide = Array{Float64}(undef, M, N, T, k)
    for m ∈ 1:M
        for i ∈ 1:N
            v = transf2(x[i, m], k)
            @tturbo for j ∈ 1:k
                for t ∈ 1:T
                    W_wide[m,i,t,j] = w[i,t]*v[j]
                end
            end
        end
    end
    W = reshape(W_wide, N*T*M, k)
end

Random.seed!(42)
result1 = test1()
Random.seed!(42)
result2 = test2()
@assert result2 ≈ result1
Random.seed!(42)
result3 = test3()
@assert result3 ≈ result2
Random.seed!(42)
result4 = test4()
@assert result4 ≈ result3

yielding

  9.086 ms (120009 allocations: 15.13 MiB) # baseline
  2.073 ms (8 allocations: 3.69 MiB) # manually optimized
  2.188 ms (8 allocations: 3.69 MiB) # @turbo
  2.810 ms (8 allocations: 3.69 MiB) # @tturbo

Here LoopVectorization seemingly allows only to annotate the inner loops, which doesn’t help anymore. And my Tullio-fu is too weak for this construction.

Topic		Replies	Views
Optimizing Calculation in Julia compared to C (New to Julia) Performance	25	2484	January 1, 2020
Speeding up julia New to Julia	15	5291	July 12, 2017
Help speedup a tight triple loop Performance	7	737	June 18, 2020
Is there any way to optimize array additions and multiplications with transposes? Numerics tullio , tensors	27	1525	August 2, 2022
Optimizing Linear Algebra Code? Performance linearalgebra	12	1338	April 9, 2021

Help optimizing simple example

Related topics