Performance drop x10 if `missing` used or two values returned from function

Hi, while doning GARCH fitting I discovered strange performance drop:

Case 1 Returning two values instead of one is x10 times slower 1.5ms vs 23μs. See place marked with CHANGE1. Is that a known issue and should not be used?

Case 2 The missing values slows down by x10 1.5ms vs 5μs see place marked with CHANGE2. It seems like a known issue, what’s the best workaround, using NaN or special float values?

Case 3 The multi method is x1.5 slower than simple method. I assumed compiler would optimise it into same code, but it’s not.

The Benchmark for Case 1, 2

using Distributions, BenchmarkTools

struct SimpleVol end

@inline predict_explicit(Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

vol_llh_explicit(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    # state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
end;

returns = rand(Normal(0, 0.015), 10_000);

isweekend(i) = (mod(i,7) == 6) || (mod(i,7) == 0);
returns2 = [isweekend(i) ? missing : returns[i] for i in 1:length(returns)];

# Warmup
vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# Bench
println("Explicit Method")
@benchmark vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# CHANGE2 use returns instead of returns2 in benchmark

Benchmark for Case 3

using Distributions, BenchmarkTools

struct SimpleVol end

@inline predict(::Type{SimpleVol}, Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

@inline predict_explicit(Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)

  v = Q.α*y + (1-Q.α)*prev.v

  (; v), 0.0
  # (; v) # CHANGE1
end

vol_llh_multi(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict(T, Q, state, rs[t])
    # state = predict(T, Q, state, rs[t]) # CHANGE1
  end
end;

vol_llh_explicit(T, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = (; v=abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    # state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
end;

returns = rand(Normal(0, 0.015), 10_000);

isweekend(i) = (mod(i,7) == 6) || (mod(i,7) == 0);
returns2 = [isweekend(i) ? missing : returns[i] for i in 1:length(returns)];

# Warmup
vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)
vol_llh_multi(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# Bench
println("Explicit Method")
@benchmark vol_llh_explicit(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)
println("Multi Method")
@benchmark vol_llh_multi(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# CHANGE2 use returns instead of returns2 in benchmark

P.S. The issue on stack overflow, but it’s a bit unrelated, it only about 3d case.

I’m not sure about the exact internal details of that, but it is related to the fact that, as it is, the state type cannot be completely inferred because rs[l] cannot be inferred to not be missing. That maybe is the regression.

In any case, if you instead of collecting state into a named tuple, you collect it in a custom struct with a concrete predefined type, the issue disappears and both versions become faster:

old code:

# single return value - without returning `state`:
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.346 μs

# two return values
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
615.610 μs (35719 allocs: 558.109 KiB)

# single return value (return state from function)
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
486.399 μs (28576 allocs: 446.500 KiB)

New code:

# 2 return values
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.346 μs

# single return value
julia> @b vol_llh_explicit($SimpleVol, $((α = 0.048, μ = 0.0004, ν = 4.203)), $returns2)
8.343 μs

Code:

struct SimpleVol end

struct State
    v::Float64
end

function predict_explicit(Q, prev, r)
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)
  v = Q.α*y + (1-Q.α)*prev.v
  #(; v), 0.0
  #(; v) # CHANGE1
  #State(v)
  State(v), 0.0
end

function vol_llh_explicit(T, Q, rs)
  n = length(rs)
  l = findfirst(!ismissing, rs)
  state = State(abs(rs[l]))
  #state = (; v = abs(rs[l]))
  for t in max(l, 2):n
    rs[t] === missing && continue
    state, _ = predict_explicit(Q, state, rs[l])
    #state = predict_explicit(Q, state, rs[l]) # CHANGE1
  end
  return state # addition: makes the results more meaningful, probably
end

Edit: the other thing is that if you return state from the functions the results change, because otherwise the compiler can eliminate the loop. Even with the one-value return function, by simply returning state from the function, there are lots of allocations. Edited the benchmarks above.

3 Likes

In line with what @lmiq said, a simple type annotation inside predict can also fix the problem.

julia> function predict_explicit_float(Q, prev, r)
         d = r - Q.μ
         y = sqrt(d*d + 1e-6)
         v::Float64 = Q.α*y + (1-Q.α)*prev
         v, 0.0
       end;

julia> function vol_llh_explicit_float(T, Q, rs)
         n = length(rs)
         l = findfirst(!ismissing, rs)
         state = abs(rs[l])
         for t in max(l, 2):n
           rs[t] === missing && continue
           state, _ = predict_explicit_float(Q, state, rs[l])
         end
         return state
       end;
2 Likes

skipmissing might be the best solution, actually.


julia> function predict_explicit(Q, prev, r)
         d = r - Q.μ
         y = sqrt(d*d + 1e-6)
         v = Q.α*y + (1-Q.α)*prev
         v, 0.0
       end;

julia> function vol_llh_explicit_skipmissing(T, Q, rs)
         n = length(rs)
         rs_sm = skipmissing(rs)
         state = abs(first(rs_sm))
         for r in rs_sm
           state, _ = predict_explicit(Q, state, r)
         end
         return state
       end;

Why that one and not skipmissing from base?

(I think you want the base one, the one from Missings.jl is not exported)

Oh sorry. I forgot that skipmissing was in Base. skipmissings (note the s) is an interesting experiment but is not exported because it doesn’t actually solve all the problems its meant to solve.

1 Like

Thanks for help, indeed, as you suggested:

The biggest problems seems to be type instability, and adding type annotation r::Float64 = rs[t] in vol_llh_explicit.

Additional annotation inside predict_explicit with v::Float64 = Q.α*y + (1-Q.α)*prev.v also speeds it up a little bit.

After these two changes it’s fast. And missing or multi dispatch also resolved, and are same fast.

Fast version 15μs, 0 allocations.

using Distributions, BenchmarkTools

struct SimpleVol end

@inline predict(::Type{SimpleVol}, Q, prev, r) = begin
  d = r - Q.μ
  y = sqrt(d*d + 1e-6)
  v::Float64 = Q.α*y + (1-Q.α)*prev.v
  (; v), 0.0
end

vol_llh(T::Type, Q, rs) = begin
  n = length(rs)
  l = findfirst(!ismissing, rs)
  r::Float64 = rs[l]
  state = (; v=abs(r))
  for t in max(l, 2):n
    rs[t] === missing && continue
    r = rs[t]
    state, _ = predict(T, Q, state, r)
  end
  state
end;

returns = rand(Normal(0, 0.015), 10_000);
isweekend(i) = (mod(i,7) == 6) || (mod(i,7) == 0);
returns2 = [isweekend(i) ? missing : returns[i] for i in 1:length(returns)];

# Warmup
vol_llh(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

# Bench
@benchmark vol_llh(SimpleVol, (α = 0.048, μ = 0.0004, ν = 4.203), returns2)

Just a note that you almost never want to annotate on the left-hand side. It has all sorts of weird semantics. Do this instead:

    v = (Q.α*y + (1-Q.α)*prev)::Float64
3 Likes

Thanks, I see it’s conversion vs assertion, yes assertion is preferable. I guess the r = rs[t]::Float64 is also preferable to r::Float64 = rs[t].

I updated the real code and now it’s much faster, thanks!

AD makes it more complicated, handling both Dual and Float64.

P.S.

Also inconvenience - lack of updating structs without reloading julia, NamedTuple work but doesnt’ look very nice instead of

@inline predict(
  Q::Params{T}, prev::State{T}, r::Float64, r2::Float64
) where {T<:Real} = begin
  ...
  return ((; v, μ, s))::State{T}, llh::T
end

You have to use

@inline predict(
  Q::@NamedTuple{
    a::T, a_up_neg::T, b::T, p::T, m::T, le_mean::T, wr::T, sr::T,
    up_sharpness::T, soft_abs_eps::T
  },
  prev::@NamedTuple{v::T, μ::T, s::T},
  r::Float64, r2::Float64
) where {T<:Real} = begin
  ...

  return ((; v, μ, s))::@NamedTuple{v::T, μ::T, s::T}, llh::T
end

The skipmissing solution avoids any need for a hard-coded type.

Another strategy here is to manually union-split the tuple. In short, Julia’s performance of Tuple{Union{Missing,Float64}, Float64} is significantly worse than Union{Tuple{Missing,Float64}, Tuple{Float64,Float64}}.

Even if you know that first element should never be missing, you can typically connive Julia into transforming the tuple-with-union to a union-of-tuples with:

if x === missing
    return (missing, y)
else
    return (x, y)
end

Yes, but not just a one-time conversion. If you put a left-hand side type annotation on a variable once, all assignments to that variable in the entire scope will perform conversion, not just the one you annotated. That’s what I meant by “all sorts of weird semantics”.

Final version, fast and allows variable arguments as input::NTuple{N, Float64}. Makes possible to use different algorithms with different input parameters.

@inline predict(
  ::Type{Vol1d},
  Q::@NamedTuple{
    a::T, a_up_neg::T, b::T, p::T, m::T, le_mean::T, wr::T, sr::T,
    up_sharpness::T, soft_abs_eps::T
  },
  prev::@NamedTuple{v::T, μ::T, s::T},
  input::NTuple{N, Float64},
  r2::Float64
) where {T<:Real, N}
1 Like