Fast fit of generalized pareto distribution

I’ve implemented a (slightly adapted) julia version of the Generalized Pareto estimation method described in this paper: DOI 10.1198/TECH.2009.08017

Am I missing any obvious performance improvements? Is it correct that this is not a good use case for StaticArrays, as the size of the small arrays is not known at compile time?

using Statistics

function fit_gpd(x::Vector{T}) where T <: Real
    x = sort(x)
    x_star = x[floor(Int, length(x) / 4.0 + 0.5)]

    n = length(x)
    m::Int = 30 + floor(sqrt(n))

    θ = [1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]

    lx(a, x) = let
        a = -a
        k = [mean(log1p.(b .* x)) for b in a]
        @. log(a / k) - k - 1
    end

    l_θ = n * lx(θ, x)
    w_θ = 1.0 ./ [sum(exp.(l_θ .- l_θ[j])) for j in 1:m]
    θ_hat = sum(θ .* w_θ)

    ξ = mean(log1p.(-θ_hat .* x))
    σ = -ξ ./ θ_hat
    ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

    return σ, ξ
end

s = rand(10000)
fit_gpd(s)

Thank you for your help

2 Likes

Here is a cleaner version IMO, that allocates far less. However, the speed gains are not much. There are 4 main allocations happening in this version:

  1. x = sort(x)
  2. θ = ...
  3. l_θ = similar(θ)
  4. w_θ = similar(l_θ)

If you pre-allocate those outside the function and pass them to the function, you can make even more gains. You can also profile your function and see where it is spending most of the time and try to improve those areas.

using LinearAlgebra, Statistics

function fit_gpd2(x::Vector{T}) where T <: Real
    x = sort(x)
    x_star = x[round(Int, length(x) / 4)]

    n = length(x)
    m = 30 + floor(sqrt(n))

    θ = Float64[1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]
    l_θ = similar(θ)
    for i in 1:length(l_θ)
		b = θ[i]
		k = mean(x -> log1p(-b * x), x)
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    w_θ = similar(l_θ)
    for j in 1:length(w_θ)
    	w_θ[j] = 1 / sum(x -> exp(x - l_θ[j]), l_θ)
    end

    θ_hat = dot(θ, w_θ)
    ξ = mean(x -> log1p(-θ_hat * x), x)
    σ = -ξ / θ_hat
    ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

    return σ, ξ
end

using BenchmarkTools

s = rand(100000)
@btime fit_gpd($s)
#  343.120 ms (2821 allocations: 266.55 MiB)
#(0.9928484317360206, -0.991738176403473)

@btime fit_gpd2($s)
#  324.617 ms (351 allocations: 800.77 KiB)
#(0.9928484317360206, -0.991738176403473)

If you find yourself doing this, you should change it to in-place sort instead:

sort!(x)

Not necessarily, you may not want the side effect of mutating the input outside the function.

I figured that maybe he doesn’t want to mutate the arguments. In that case, I would:

fit_gpd2(x::AbstractVector{<:Real}}) = fit_gpd2!(copy(x))
function fit_gpd2!(x::AbstractVector{T}) where T <: Real
    sort!(x)
    x_star = ....

Yeah, ok. But seems worth pointing out, anyway, since he’s asking about optimizations.

In general, I think you are creating a lot of vectors, and I wonder if you can re-write much of it into scalar form. It’s a bit hard to say exactly what to change, but do all of θ, l_θ, w_θ, θ_hat have to be vectors? Can’t you turn the whole thing into a scalar loop?

Also, though this is a small optimization, you are better off using integers more, rather than floats everywhere. For example, replace

floor(Int, length(x) / 4.0 + 0.5)

with

div(length(x) + 2, 4)

and

ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

with

ξ = (ξ * n + 5) / (n + 10)

etc.

Pretty much all your literals should be integers instead of floats, it’s (marginally) faster, preserves types better, and looks cleaner.

1 Like

If you really want less allocation a little bit more performance, you should manually “unroll” the broadcast operations inside the loops. This is faster because you know you won’t access memory outside the size of the vectors:

function fit_gpd3(x::Vector{T}) where T <: Real
    n      = length(x)
    x      = sort(x)
    x_star = x[div(length(x) + 2, 4)]
    m      = 30 + floor(Int, sqrt(n))

    θ   = Vector{Float64}(undef, m)
    l_θ = Vector{Float64}(undef, m)
    w_θ = Vector{Float64}(undef, m)

    @inbounds for j in 1:m
        θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
    end

    @inbounds for i in 1:m
		b = θ[i]
        k = 0.0

        @simd for j = 1:n
            k += log1p(-b * x[j])
        end

        k /= n
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    @inbounds for j in 1:m
        aux = 0.0

        @simd for i = 1:m
            aux += exp(l_θ[i] - l_θ[j])
        end

        w_θ[j] = 1 / aux
    end

    θ_hat = dot(θ, w_θ)

    ξ = 0.0

    @inbounds @simd for i = 1:n
        ξ += log1p(-θ_hat * x[i])
    end

    ξ /= n
    σ  = -ξ / θ_hat
    ξ  = (ξ * n + 5 ) / (n + 10)

    return σ, ξ
end
julia> @btime fit_gpd2($s)
  305.727 ms (351 allocations: 800.77 KiB)
(0.9913807459446422, -0.9902799795309473)

julia> @btime fit_gpd3($s)
  296.859 ms (5 allocations: 789.95 KiB)
(0.9913807459446389, -0.9902799795309439)

However, given the poor readability, I am not sure if it’s worth it.

Besides that, this is a naive sum algorithm, and you should have problems in some cases compared to the mean and sum approach.

Ah! I realize a better method that will not have the sum problem. Look that what you need is two auxiliary vectors, one of size n and other of size m to store the operations (log1p and exp) that happen inside the loops. In this case, you can gain performance and reduce the allocations while still using the sum and mean functions that take care about the sum algorithm:

function fit_gpd4(x::Vector{T}) where T <: Real
    n      = length(x)
    x      = sort(x)
    x_star = x[div(length(x) + 2, 4)]
    m      = 30 + floor(Int, sqrt(n))

    θ    = Vector{Float64}(undef, m)
    l_θ  = Vector{Float64}(undef, m)
    w_θ  = Vector{Float64}(undef, m)
    aux1 = Vector{Float64}(undef, n)
    aux2 = Vector{Float64}(undef, m)

    @inbounds for j in 1:m
        θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
    end

    @inbounds for i in 1:m
		b = θ[i]
        @. aux1 = log1p(-b * x)
		k = mean(aux1)
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    @inbounds for j in 1:m
        @. aux2 = exp.(l_θ - l_θ[j])
        w_θ[j] = 1 / sum(aux2)
    end

    θ_hat = dot(θ, w_θ)
    @. aux1 = log1p(-θ_hat * x)

    ξ = mean(aux1)
    σ = -ξ / θ_hat
    ξ = (ξ * n + 5 ) / (n + 10)

    return σ, ξ
end
julia> @btime fit_gpd4($s)
  296.737 ms (8 allocations: 1.54 MiB)
(0.9913807459446421, -0.9902799795309472)

Thank you all for your replies, this is really useful.
What I take away is:

  1. Use anonymous functions for sum and mean (or use manual loops instead of broadcast for a slight speed increase) instead of array comprehensions.
  2. it’s okay to use integer literals (I misunderstood the Performance Tips section there).