Performance drop x10 if `missing` used or two values returned from function

Hi, while doning GARCH fitting I discovered strange performance drop:

Case 1 Returning two values instead of one is x10 times slower 1.5ms vs 23μs. See place marked with CHANGE1. Is that a known issue and should not be used?

Case 2 The missing values slows down by x10 1.5ms vs 5μs see place marked with CHANGE2. It seems like a known issue, what’s the best workaround, using NaN or special float values?

Case 3 The multi method is x1.5 slower than simple method. I assumed compiler would optimise it into same code, but it’s not.

The Benchmark for Case 1, 2

using Distributions, BenchmarkTools

struct SimpleVol end

@inline predict_explicit(Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

vol_llh_explicit(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    # state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
end;

returns = rand(Normal(0, 0.015), 10_000);

isweekend(i) = (mod(i,7) == 6) || (mod(i,7) == 0);
returns2 = [isweekend(i) ? missing : returns[i] for i in 1:length(returns)];

# Warmup
vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# Bench
println("Explicit Method")
@benchmark vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# CHANGE2 use returns instead of returns2 in benchmark

Benchmark for Case 3

using Distributions, BenchmarkTools

struct SimpleVol end

@inline predict(::Type{SimpleVol}, Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

@inline predict_explicit(Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

vol_llh_multi(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict(T, Q, state, rs[t])
    # state = predict(T, Q, state, rs[t]) # CHANGE1
  end
end;

vol_llh_explicit(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    # state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
end;

returns = rand(Normal(0, 0.015), 10_000);

isweekend(i) = (mod(i,7) == 6) || (mod(i,7) == 0);
returns2 = [isweekend(i) ? missing : returns[i] for i in 1:length(returns)];

# Warmup
vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)
vol_llh_multi(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# Bench
println("Explicit Method")
@benchmark vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)
println("Multi Method")
@benchmark vol_llh_multi(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# CHANGE2 use returns instead of returns2 in benchmark

P.S. The issue on stack overflow, but it’s a bit unrelated, it only about 3d case.

I’m not sure about the exact internal details of that, but it is related to the fact that, as it is, the state type cannot be completely inferred because rs[l] cannot be inferred to not be missing. That maybe is the regression.

In any case, if you instead of collecting state into a named tuple, you collect it in a custom struct with a concrete predefined type, the issue disappears and both versions become faster:

old code:

# single return value - without returning `state`:
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.346 μs

# two return values
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
615.610 μs (35719 allocs: 558.109 KiB)

# single return value (return state from function)
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
486.399 μs (28576 allocs: 446.500 KiB)

New code:

# 2 return values
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.346 μs

# single return value
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.343 μs

Code:

struct SimpleVol end

struct State
    v::Float64
end

function predict_explicit(Q, prev, r)
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)
  v = Q.α*y + (1-Q.α)*prev.v
  #(; v), 0.0
  #(; v) # CHANGE1
  #State(v)
  State(v), 0.0
end

function vol_llh_explicit(T, Q, rs)
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = State(abs(rs[l]))
  #state = (; v = abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    #state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
  return state # addition: makes the results more meaningful, probably
end

Edit: the other thing is that if you return state from the functions the results change, because otherwise the compiler can eliminate the loop. Even with the one-value return function, by simply returning state from the function, there are lots of allocations. Edited the benchmarks above.

2 Likes

In line with what @lmiq said, a simple type annotation inside predict can also fix the problem.

julia> function predict_explicit_float(Q, prev, r)
         d = r - Q.μ
         y = sqrt(d*d + 1e-6)
         v::Float64 = Q.α*y + (1-Q.α)*prev
         v, 0.0
       end;

julia> function vol_llh_explicit_float(T, Q, rs)
         n = length(rs)
         l = findfirst(!ismissing, rs)
         state = abs(rs[l])
         for t in max(l, 2):n
           rs[t] === missing && continue
           state, _ = predict_explicit_float(Q, state, rs[l])
         end
         return state
       end;
1 Like

skipmissing might be the best solution, actually.


julia> function predict_explicit(Q, prev, r)
         d = r - Q.μ
         y = sqrt(d*d + 1e-6)
         v = Q.α*y + (1-Q.α)*prev
         v, 0.0
       end;

julia> function vol_llh_explicit_skipmissing(T, Q, rs)
         n = length(rs)
         rs_sm = skipmissing(rs)
         state = abs(first(rs_sm))
         for r in rs_sm
           state, _ = predict_explicit(Q, state, r)
         end
         return state
       end;

Why that one and not skipmissing from base?

(I think you want the base one, the one from Missings.jl is not exported)

Oh sorry. I forgot that skipmissing was in Base. skipmissings (note the s) is an interesting experiment but is not exported because it doesn’t actually solve all the problems its meant to solve.

1 Like