How to speed up these functions?

I have been working on estimating a model via indirect inference. For that to be feasible I need to speed up these functions that jointly comprise the auxiliary model. Can anyone spot optimizations?
set up:

using LinearAlgebra

n = 1000
t0 = rand(1:200, n)
t1 = rand(1:200, n)
t = vcat(t0, t1)

c = similar(t)
for i in 1:2n
    rand() < 0.5 ? c[i] = t[i] : c[i] = rand(1:t[i])
end

xx0 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)]
xx1 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)]
xx = vcat(xx0, xx1)

α = 1.2
β = [-1.33, -0.006, 0.12]
γ = [0.07, 0.03, 0.5]
ρ = 11.3
ψ0 = 0.046
ψ1 = 0.040

functions:

# weibull proportional hazard model
function logℒ(α, β, t, c, x)
    eα = abs(α)
    c = (c .== t)
    n = size(x, 1)
    xb = hcat(ones(n), x) * β

    sum( (1 .- c).*(log(eα) .+ (eα - 1).*log.(t) .+ xb)) - sum(t.^eα .* exp.(xb) )
end
# ordered logit
function 𝒬(γ, t, x)
    n = size(x, 1)
    d0 = t .> 0
    d1 = t .> 1
    x0 = [x zeros(n)]
    x1 = [x ones(n)]
    xb0 = x0 * γ
    xb1 = x1 * γ

    dot(xb0, d0) + dot(xb1, d1) - sum(log.(1 .+ exp.(xb0))) - sum(log.(1 .+ exp.(xb1)))
end
# covariance of residuals
## removes collinear columns
function removeredundant(x)
    return hcat(unique(eachcol(x))...)
end

function 𝒞(ρ, t0, t1, xx0, xx1)
    xx = removeredundant([xx0 xx1])
    xx = xx * ((transpose(xx) * xx) \ transpose(xx))

    y1 = log.(t1)
    y0 = log.(t0)
    e1 = y1 - xx * y1
    e0 = y0 - xx * y0
    -sum(((e1 .* e0) .- ρ).^2)
end
# jumps
## indicator function
function boole(t, age; ll_ret=1, ul_ret=2)
    age - ll_ret .<= t .<= age + ul_ret
end

function 𝒮(ψ0, ψ1, t0, t1; age=4*(65-45), ll_ret=1, ul_ret=2)
    -sum( (boole(t0, age; ll_ret=ll_ret, ul_ret=ul_ret) .- ψ0).^2 ) -
    sum( (boole(t1, age; ll_ret=ll_ret, ul_ret=ul_ret) .- ψ1).^2 )
end

Eg in 𝒬 , I would just accumulate the sum in a loop and not instantiate d0, d1, x0, x1, xb0, xb1. At a first glance, you should be able to make this code allocation-free. Same applies go logℒ.

Incidentally, use LogExpFunctions.log1pexp for things like log(1 + exp(...)), it is much better numerically.

6 Likes

How can I prevent instantiation of those arrays?

By not creating them? Again, use a loop.

Of course, I’m just having a hard time visualising how to multiply matrices without creating them, especially ones that have a constant column.

Could you add some code to run here? I see some function definitions, and some setup of parameters, but it’s not clear to me how to call the functions.

2 Likes

I wrote the question so that all objects are defined. So you can do this:

logℒ(α, β, t, c, xx) +
𝒬(γ, t, xx) +
𝒞(ρ, t0, t1, x0, xx1) +
𝒮(ψ0, ψ1, t0, t1)

Is that what you mean?

1 Like

Yes, this is exactly what I was missing.

Perhaps it was obvious to you how to call them, but there isn’t necessarily any connection between the names of arguments in a function signature and the names of variables in the calling context.

Optimally, a minimally working example should be ‘copy+paste+enter’.

2 Likes

I see, I tried to simplify things a bit. I actually call these functions inside a wrapper that takes a vector of parameters θ (the Greek letters) and data (the Latin letters) and distributes them to these functions, which I then sum. Then I need to optimize that wrapper with respect to θ.

Is it cheating to observe that c == t only when c == t == 1, in which case (1 - c) == 0, and then most of the expression

(1 .- c).*(log(eα) .+ (eα - 1).*log.(t) .+ xb)) - sum(t.^eα .* exp.(xb)

disappears?

If it is cheating, can you modify your example inputs to avoid this behaviour?

Good catch, this should fix it (I also edited the original post):

c = similar(t)
for i in 1:2n
    rand() < 0.5 ? c[i] = t[i] : c[i] = rand(1:t[i])
end

For example, this implementation is about 3x faster on my machine (for t = randn(2000); x = randn(2000, 2); γ = [0.07, 0.03, 0.5]), and is allocation-free:

using LoopVectorization

# compute log(1 + exp(x)) without spurious overflow
softplus(x) = log1p(exp(-abs(x))) + max(x, 0)

function 𝒬_fast(γ, t, x)
    s = log(1 + zero(eltype(x)))
    n = size(x, 1)
    (n == length(t) && length(γ) == size(x, 2)+1 == 3) || throw(DimensionMismatch())
    @turbo for i = 1:size(x,1)
        xb0 = x[i,1] * γ[1] + x[i,2] * γ[2]
        xb1 = xb0 + γ[3]
        s += (t[i] > 0) * xb0 + (t[i] > 1) * xb1 - softplus(xb0) - softplus(xb1)
    end
    return s
end

Note that I copied the implementation of softplus from NNLib.jl — computing log(1 + exp(x)) directly will overflow and give an incorrect result of Inf if x is too large. Note also that computing xb0 and xb1 share most of their operations in common.

I “cheated” a bit by assuming that x has 2 columns as in your code so that I could inline the loop over columns. There are various games you could play here to speed things up further and to make the code more generic.

You might also want to transpose your x arrays, so that it uses x[1,i] and x[2,i] (contiguous) in the inner loop rather than x[i,1] and x[i,2]. (Or use an array of StaticArrays.)

4 Likes

I tried to generalize the lines

        xb0 = x[i,1] * γ[1] + x[i,2] * γ[2]
        xb1 = xb0 + γ[3]

for the case with k columns in x this way: first define n, k = size(x) outside the loop, then replace the first 2 lines in the loop with

        xb0 = sum(x[i,j] * γ[j] for j in 1:k)
        xb1 = xb0 + γ[end]

but in that case the macro @turbo throws an error:

ERROR: LoadError: LoadError: Expression not recognized.
(x[i, j] * γ[j] for j = 1:k)

is there any workaround? perhaps another acceleration package?

This is the current status:

n = 1000
t0 = rand(1:200, n)
t1 = rand(1:200, n)
t = vcat(t0, t1)

c = similar(t)
for i in 1:2n
    rand() < 0.5 ? c[i] = t[i] : c[i] = rand(1:t[i])
end

xx0 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)]
xx1 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)]
xx = vcat(xx0, xx1)

α = 1.2
β = [-1.33, -0.006, 0.12]
γ = [0.07, 0.03, 0.5]
ρ = 11.3
ψ0 = 0.046
ψ1 = 0.040

function logℒ_fast(α, β, t, c, x)
    eα = abs(α)
    n, k = size(x)

    (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
    s = zero(typeof(α))
    for i in 1:n
        xb = sum(x[i, j] * β[j+1] for j in 1:k) + β[1]
        ti = t[i]
        s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
    end
    return s
end

# compute log(1 + exp(x)) without spurious overflow
softplus(x) = log1p(exp(-abs(x))) + max(x, 0)

function 𝒬_fast(γ, t, x)
    s = log(1 + zero(eltype(x)))
    n, k = size(x)
    (n == length(t) && length(γ) == k + 1) || throw(DimensionMismatch())
    for i = 1:n
        xb0 = sum(x[i, j] * γ[j] for j in 1:k)
        xb1 = xb0 + γ[end]
        s += (t[i] > 0) * xb0 + (t[i] > 1) * xb1 - softplus(xb0) - softplus(xb1)
    end
    return s
end

## removes collinear columns
function removeredundant(x)
    return hcat(unique(eachcol(x))...)
end

function 𝒞_fast(ρ, t0, t1, xx0, xx1)
    xx = removeredundant([xx0 xx1])

    y1 = log.(t1)
    y0 = log.(t0)
    b1 = xx \ y1
    b0 = xx \ y0

    s = zero(eltype(ρ))
    n, k = size(xx)
    (n == length(t0) == length(t1)) || throw(DimensionMismatch())
    for i = 1:n
        e1 = y1[i] - sum(xx[i, j] * b1[j] for j in 1:k)
        e0 = y0[i] - sum(xx[i, j] * b0[j] for j in 1:k)
        s -= (e1 * e0 - ρ)^2
    end
    return s
end

function boole_i(t, age; ll_ret=1, ul_ret=2)
    age - ll_ret <= t <= age + ul_ret
end

function 𝒮_fast(ψ0, ψ1, t0, t1; age=4*(65-45), ll_ret=1, ul_ret=2)
    n = length(t0)
    (n == length(t1)) || throw(DimensionMismatch())
    s = zero(typeof(ψ0))

    for i in 1:n
        s -= (boole_i(t0[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ0)^2 +
            (boole_i(t1[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ1)^2
    end
    return s
end

logℒ_fast(α, β, t, c, xx) +
𝒬_fast(γ, t, xx) +
𝒞_fast(ρ, t0, t1, xx0, xx1) +
𝒮_fast(ψ0, ψ1, t0, t1)

There are some allocations I didn’t know how to avoid in 𝒞_fast:

  1. I need to remove collinear columns in the matrix [xx0 xx1].
  2. I need to compute the OLS coefficients of regressing log(t1) on xx, and log(t0) on xx.

Does anyone have any suggestions on avoiding those allocations?
Does anyone have any suggestions on further optimizations? (I know I can transpose the matrices to exploit the column-major ordering of Julia).

Also, as I mentioned above, I get an error when trying to use LoopVectorization in the loops. I think that happens because @turbo does not support more than one index and I want to generalize the program to allow for a larger xx matrix. How can I avoid that error?

LoopVectorization doesn’t support generators.


function logℒ_fast(α, β, t, c, x)
    eα = abs(α)
    n, k = size(x)

    (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
    s = zero(typeof(α))
    @turbo for i in 1:n
        xb0 = 0.0
        for j in 1:k
            xb0 += x[i, j] * β[j+1]
        end
        xb = xb0 + β[1]
        ti = t[i]
        s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
    end
    return s
end
function 𝒮_fast(ψ0, ψ1, t0, t1; age=4*(65-45), ll_ret=1, ul_ret=2)
    n = length(t0)
    (n == length(t1)) || throw(DimensionMismatch())
    s = zero(typeof(ψ0))

    @inbounds @simd for i in 1:n
        s -= (boole_i(t0[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ0)^2 +
            (boole_i(t1[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ1)^2
    end
    return s
end
2 Likes

I get an error that end is not defined in 𝒬_fast. That’s from γ[end]. I can replace that with last(γ), but it’s a little strange.

There is an issue for end support. There also seems to be another problem that I’m fixing at the moment.
@amrods

This is what I’m getting:

julia> @btime logℒ_fast($α, $β, $t, $c, $xx) +
         𝒬_fast($γ, $t, $xx) +
         𝒞_fast($ρ, $t0, $t1, $xx0, $xx1) +
         𝒮_fast($ψ0, $ψ1, $t0, $t1)
  1.140 ms (18092 allocations: 704.41 KiB)
-9.978867771099854e8

julia> @btime logℒ_fast_turbo($α, $β, $t, $c, $xx) +
         𝒬_fast_turbo($γ, $t, $xx) +
         𝒞_fast_turbo($ρ, $t0, $t1, $xx0, $xx1) +
         𝒮_fast_simd($ψ0, $ψ1, $t0, $t1)
  121.907 μs (148 allocations: 299.58 KiB)
-9.978867771099855e8

All those allocations, especially with the original versions, suggest you can probably still optimize them by a lot. But I just wanted to show @turbo on three of them.

Code
using LoopVectorization, BenchmarkTools
n = 1000;
t0 = rand(1:200, n);
t1 = rand(1:200, n);
t = vcat(t0, t1);

c = similar(t);
for i in 1:2n
    rand() < 0.5 ? c[i] = t[i] : c[i] = rand(1:t[i])
end

xx0 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)];
xx1 = 100 .* [rand(n).*rand([-1, 1], n) rand(n).*rand([-1, 1], n)];
xx = vcat(xx0, xx1);

α = 1.2;
β = [-1.33, -0.006, 0.12];
γ = [0.07, 0.03, 0.5];
ρ = 11.3;
ψ0 = 0.046;
ψ1 = 0.040;

function logℒ_fast(α, β, t, c, x)
  eα = abs(α)
  n, k = size(x)

  (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
  s = zero(typeof(α))
  @inbounds for i in 1:n
    xb = sum(@inbounds(x[i, j] * β[j+1]) for j in 1:k) + β[1]
    ti = t[i]
    s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
  end
  return s
end

# compute log(1 + exp(x)) without spurious overflow
softplus(x) = log1p(exp(-abs(x))) + max(x, 0)

function 𝒬_fast(γ, t, x)
  s = log(1 + zero(eltype(x)))
  n, k = size(x)
  (n == length(t) && length(γ) == k + 1) || throw(DimensionMismatch())
  @inbounds for i = 1:n
    xb0 = sum(@inbounds(x[i, j] * γ[j]) for j in 1:k)
    xb1 = xb0 + γ[end]
    s += (t[i] > 0) * xb0 + (t[i] > 1) * xb1 - softplus(xb0) - softplus(xb1)
  end
  return s
end

## removes collinear columns
function removeredundant(x)
  return hcat(unique(eachcol(x))...)
end

function 𝒞_fast(ρ, t0, t1, xx0, xx1)
  xx = removeredundant([xx0 xx1])

  y1 = log.(t1)
  y0 = log.(t0)
  b1 = xx \ y1
  b0 = xx \ y0

  s = zero(eltype(ρ))
  n, k = size(xx)
  (n == length(t0) == length(t1)) || throw(DimensionMismatch())
  @inbounds for i = 1:n
    e1 = y1[i] - sum(@inbounds(xx[i, j] * b1[j]) for j in 1:k)
    e0 = y0[i] - sum(@inbounds(xx[i, j] * b0[j]) for j in 1:k)
    s -= (e1 * e0 - ρ)^2
  end
  return s
end

function boole_i(t, age; ll_ret=1, ul_ret=2)
  age - ll_ret <= t <= age + ul_ret
end

function 𝒮_fast(ψ0, ψ1, t0, t1; age=4*(65-45), ll_ret=1, ul_ret=2)
  n = length(t0)
  (n == length(t1)) || throw(DimensionMismatch())
  s = zero(typeof(ψ0))

  for i in 1:n
    s -= (boole_i(t0[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ0)^2 +
      (boole_i(t1[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ1)^2
  end
  return s
end

logℒ_fast(α, β, t, c, xx) +
  𝒬_fast(γ, t, xx) +
  𝒞_fast(ρ, t0, t1, xx0, xx1) +
  𝒮_fast(ψ0, ψ1, t0, t1)


function logℒ_fast_turbo(α, β, t, c, x)
  eα = abs(α)
  n, k = size(x)

  (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
  s = zero(typeof(α))
  @turbo for i in 1:n
    xb0 = 0.0
    for j in 1:k
      xb0 += x[i,j] * β[j+1]
    end
    xb = xb0 + β[1]
    ti = t[i]
    s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
  end
  return s
end

function 𝒬_fast_turbo(γ, t, x)
  s = log(1 + zero(eltype(x)))
  n, k = size(x)
  (n == length(t) && length(γ) == k + 1) || throw(DimensionMismatch())
  @turbo for i = 1:n
    xb0 = 0.0
    for j in 1:k
      xb0 += x[i, j] * γ[j]
    end
    xb1 = xb0 + γ[k+1]
    s += (t[i] > 0) * xb0 + (t[i] > 1) * xb1 - softplus(xb0) - softplus(xb1)
  end
  return s
end

function 𝒞_fast_turbo(ρ, t0, t1, xx0, xx1)
  xx = removeredundant([xx0 xx1])

  y1 = log.(t1)
  y0 = log.(t0)
  b1 = xx \ y1
  b0 = xx \ y0

  s = zero(eltype(ρ))
  n, k = size(xx)
  (n == length(t0) == length(t1)) || throw(DimensionMismatch())
  @turbo for i = 1:n
    e1acc = 0.0
    e0acc = 0.0
    for j in 1:k
      e1acc += xx[i, j] * b1[j]
      e0acc += xx[i, j] * b0[j]
    end
    e1 = y1[i] - e1acc
    e0 = y0[i] - e0acc
    s -= (e1 * e0 - ρ)^2
  end
  return s
end

function 𝒮_fast_simd(ψ0, ψ1, t0, t1; age=4*(65-45), ll_ret=1, ul_ret=2)
  n = length(t0)
  (n == length(t1)) || throw(DimensionMismatch())
  s = zero(typeof(ψ0))

  @inbounds @simd for i in 1:n
    s -= (boole_i(t0[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ0)^2 +
      (boole_i(t1[i], age; ll_ret=ll_ret, ul_ret=ul_ret) - ψ1)^2
  end
  return s
end

@btime logℒ_fast($α, $β, $t, $c, $xx) +
  𝒬_fast($γ, $t, $xx) +
  𝒞_fast($ρ, $t0, $t1, $xx0, $xx1) +
  𝒮_fast($ψ0, $ψ1, $t0, $t1)

@btime logℒ_fast_turbo($α, $β, $t, $c, $xx) +
  𝒬_fast_turbo($γ, $t, $xx) +
  𝒞_fast_turbo($ρ, $t0, $t1, $xx0, $xx1) +
  𝒮_fast_simd($ψ0, $ψ1, $t0, $t1)
3 Likes

Thank you for that. The allocations arise in 𝒞_fast for the reasons I mentioned above. I will think about how to devectorize those operations, but I’m not too worried about that.

Aside, I have to say that naming that macro @turbo is one of the coolest things I have seen in Julia. :slight_smile:

I’m getting some puzzling results. Consider this:

using LoopVectorization

t = [26, 94]
c = [26, 94]
x = collect([25 12]')

α = 2.85
β = [-13.3, -0.006]

function logℒ_fast(α, β, t, c, x)
    eα = abs(α)
    n, k = size(x)
  
    (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
    s = zero(typeof(α))
    @inbounds for i in 1:n
      xb = sum(@inbounds(x[i, j] * β[j+1]) for j in 1:k) + β[1]
      ti = t[i]
      s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
    end
    return s
end

function logℒ_fast_turbo(α, β, t, c, x)
    eα = abs(α)
    n, k = size(x)
  
    (n == length(t) == length(c) && length(β) == k + 1) || throw(DimensionMismatch())
    s = zero(typeof(α))
    @turbo for i in 1:n
      xb0 = 0.0
      for j in 1:k
        xb0 += x[i,j] * β[j+1]
      end
      xb = xb0 + β[1]
      ti = t[i]
      s += (1 - (c[i] == ti)) * (log(eα) + (eα - 1) * log(ti) + xb) - ti^eα * exp(xb)
    end
    return s
end

Those are the same functions you defined above, however:

julia> logℒ_fast_turbo(α, β, t, c, x)
NaN

julia> logℒ_fast(α, β, t, c, x)
-0.6702178908853803

I can’t figure out what’s going on here.

1 Like

My bad, fixed and added to the tests.
I get, using n = 2, k = 1:

julia> @btime logℒ_fast_turbo($α, $β, $t, $c, $x)
  38.984 ns (0 allocations: 0 bytes)
-0.67021789088538

julia> @btime logℒ_fast($α, $β, $t, $c, $x)
  180.717 ns (0 allocations: 0 bytes)
-0.6702178908853803

julia> @btime logℒ_fast_loops($α, $β, $t, $c, $x)
  172.168 ns (0 allocations: 0 bytes)
-0.6702178908853803

Where logℒ_fast_loops replaces @turbo with @inbounds @fastmath.

Using n, k = 100, 10:

julia> @btime logℒ_fast($α, $β, $t, $c, $x)
  9.150 μs (0 allocations: 0 bytes)
-300.9641041610733

julia> @btime logℒ_fast_loops($α, $β, $t, $c, $x)
  9.469 μs (0 allocations: 0 bytes)
-300.96410416107324

julia> @btime logℒ_fast_turbo($α, $β, $t, $c, $x)
  305.966 ns (0 allocations: 0 bytes)
-300.96410416107324

julia> versioninfo()
Julia Version 1.8.0-DEV.320
Commit 500db66766* (2021-08-05 19:57 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)

Roughly 30x faster.

3 Likes