Fast fit of generalized pareto distribution

daniel · December 18, 2018, 12:39am

I’ve implemented a (slightly adapted) julia version of the Generalized Pareto estimation method described in this paper: DOI 10.1198/TECH.2009.08017

Am I missing any obvious performance improvements? Is it correct that this is not a good use case for StaticArrays, as the size of the small arrays is not known at compile time?

using Statistics

function fit_gpd(x::Vector{T}) where T <: Real
    x = sort(x)
    x_star = x[floor(Int, length(x) / 4.0 + 0.5)]

    n = length(x)
    m::Int = 30 + floor(sqrt(n))

    θ = [1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]

    lx(a, x) = let
        a = -a
        k = [mean(log1p.(b .* x)) for b in a]
        @. log(a / k) - k - 1
    end

    l_θ = n * lx(θ, x)
    w_θ = 1.0 ./ [sum(exp.(l_θ .- l_θ[j])) for j in 1:m]
    θ_hat = sum(θ .* w_θ)

    ξ = mean(log1p.(-θ_hat .* x))
    σ = -ξ ./ θ_hat
    ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

    return σ, ξ
end

s = rand(10000)
fit_gpd(s)

Thank you for your help

mohamed82008 · December 18, 2018, 5:05am

Here is a cleaner version IMO, that allocates far less. However, the speed gains are not much. There are 4 main allocations happening in this version:

x = sort(x)
θ = ...
l_θ = similar(θ)
w_θ = similar(l_θ)

If you pre-allocate those outside the function and pass them to the function, you can make even more gains. You can also profile your function and see where it is spending most of the time and try to improve those areas.

using LinearAlgebra, Statistics

function fit_gpd2(x::Vector{T}) where T <: Real
    x = sort(x)
    x_star = x[round(Int, length(x) / 4)]

    n = length(x)
    m = 30 + floor(sqrt(n))

    θ = Float64[1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]
    l_θ = similar(θ)
    for i in 1:length(l_θ)
		b = θ[i]
		k = mean(x -> log1p(-b * x), x)
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    w_θ = similar(l_θ)
    for j in 1:length(w_θ)
    	w_θ[j] = 1 / sum(x -> exp(x - l_θ[j]), l_θ)
    end

    θ_hat = dot(θ, w_θ)
    ξ = mean(x -> log1p(-θ_hat * x), x)
    σ = -ξ / θ_hat
    ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

    return σ, ξ
end

using BenchmarkTools

s = rand(100000)
@btime fit_gpd($s)
#  343.120 ms (2821 allocations: 266.55 MiB)
#(0.9928484317360206, -0.991738176403473)

@btime fit_gpd2($s)
#  324.617 ms (351 allocations: 800.77 KiB)
#(0.9928484317360206, -0.991738176403473)

DNF · December 18, 2018, 8:23am

If you find yourself doing this, you should change it to in-place sort instead:

sort!(x)

mohamed82008 · December 18, 2018, 8:26am

Not necessarily, you may not want the side effect of mutating the input outside the function.

Elrod · December 18, 2018, 8:26am

I figured that maybe he doesn’t want to mutate the arguments. In that case, I would:

fit_gpd2(x::AbstractVector{<:Real}}) = fit_gpd2!(copy(x))
function fit_gpd2!(x::AbstractVector{T}) where T <: Real
    sort!(x)
    x_star = ....

DNF · December 18, 2018, 8:27am

Yeah, ok. But seems worth pointing out, anyway, since he’s asking about optimizations.

DNF · December 18, 2018, 8:56am

In general, I think you are creating a lot of vectors, and I wonder if you can re-write much of it into scalar form. It’s a bit hard to say exactly what to change, but do all of θ, l_θ, w_θ, θ_hat have to be vectors? Can’t you turn the whole thing into a scalar loop?

Also, though this is a small optimization, you are better off using integers more, rather than floats everywhere. For example, replace

floor(Int, length(x) / 4.0 + 0.5)

with

div(length(x) + 2, 4)

and

ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

with

ξ = (ξ * n + 5) / (n + 10)

etc.

Pretty much all your literals should be integers instead of floats, it’s (marginally) faster, preserves types better, and looks cleaner.

Ronis_BR · December 18, 2018, 12:16pm

If you really want less allocation a little bit more performance, you should manually “unroll” the broadcast operations inside the loops. This is faster because you know you won’t access memory outside the size of the vectors:

function fit_gpd3(x::Vector{T}) where T <: Real
    n      = length(x)
    x      = sort(x)
    x_star = x[div(length(x) + 2, 4)]
    m      = 30 + floor(Int, sqrt(n))

    θ   = Vector{Float64}(undef, m)
    l_θ = Vector{Float64}(undef, m)
    w_θ = Vector{Float64}(undef, m)

    @inbounds for j in 1:m
        θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
    end

    @inbounds for i in 1:m
		b = θ[i]
        k = 0.0

        @simd for j = 1:n
            k += log1p(-b * x[j])
        end

        k /= n
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    @inbounds for j in 1:m
        aux = 0.0

        @simd for i = 1:m
            aux += exp(l_θ[i] - l_θ[j])
        end

        w_θ[j] = 1 / aux
    end

    θ_hat = dot(θ, w_θ)

    ξ = 0.0

    @inbounds @simd for i = 1:n
        ξ += log1p(-θ_hat * x[i])
    end

    ξ /= n
    σ  = -ξ / θ_hat
    ξ  = (ξ * n + 5 ) / (n + 10)

    return σ, ξ
end

julia> @btime fit_gpd2($s)
  305.727 ms (351 allocations: 800.77 KiB)
(0.9913807459446422, -0.9902799795309473)

julia> @btime fit_gpd3($s)
  296.859 ms (5 allocations: 789.95 KiB)
(0.9913807459446389, -0.9902799795309439)

However, given the poor readability, I am not sure if it’s worth it.

Besides that, this is a naive sum algorithm, and you should have problems in some cases compared to the mean and sum approach.

Ronis_BR · December 18, 2018, 12:25pm

Ah! I realize a better method that will not have the sum problem. Look that what you need is two auxiliary vectors, one of size n and other of size m to store the operations (log1p and exp) that happen inside the loops. In this case, you can gain performance and reduce the allocations while still using the sum and mean functions that take care about the sum algorithm:

function fit_gpd4(x::Vector{T}) where T <: Real
    n      = length(x)
    x      = sort(x)
    x_star = x[div(length(x) + 2, 4)]
    m      = 30 + floor(Int, sqrt(n))

    θ    = Vector{Float64}(undef, m)
    l_θ  = Vector{Float64}(undef, m)
    w_θ  = Vector{Float64}(undef, m)
    aux1 = Vector{Float64}(undef, n)
    aux2 = Vector{Float64}(undef, m)

    @inbounds for j in 1:m
        θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
    end

    @inbounds for i in 1:m
		b = θ[i]
        @. aux1 = log1p(-b * x)
		k = mean(aux1)
		l_θ[i] = n * (log(-b / k) - k - 1)
	end

    @inbounds for j in 1:m
        @. aux2 = exp.(l_θ - l_θ[j])
        w_θ[j] = 1 / sum(aux2)
    end

    θ_hat = dot(θ, w_θ)
    @. aux1 = log1p(-θ_hat * x)

    ξ = mean(aux1)
    σ = -ξ / θ_hat
    ξ = (ξ * n + 5 ) / (n + 10)

    return σ, ξ
end

julia> @btime fit_gpd4($s)
  296.737 ms (8 allocations: 1.54 MiB)
(0.9913807459446421, -0.9902799795309472)

daniel · December 18, 2018, 5:31pm

Thank you all for your replies, this is really useful.
What I take away is:

Use anonymous functions for sum and mean (or use manual loops instead of broadcast for a slight speed increase) instead of array comprehensions.
it’s okay to use integer literals (I misunderstood the Performance Tips section there).

Topic		Replies	Views
Optimizing Code to Fit a Generalized Pareto Distribution to Data Performance	17	3173	June 12, 2021
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Parallel reductions Julia at Scale	23	5172	June 19, 2022
Too many allocations? Performance	23	1670	February 18, 2021
Julian way to write this code Performance question	18	913	September 22, 2024

Fast fit of generalized pareto distribution

Related topics