I’ve implemented a (slightly adapted) julia version of the Generalized Pareto estimation method described in this paper: DOI 10.1198/TECH.2009.08017

Am I missing any obvious performance improvements? Is it correct that this is not a good use case for StaticArrays, as the size of the small arrays is not known at compile time?

using Statistics
function fit_gpd(x::Vector{T}) where T <: Real
x = sort(x)
x_star = x[floor(Int, length(x) / 4.0 + 0.5)]
n = length(x)
m::Int = 30 + floor(sqrt(n))
θ = [1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]
lx(a, x) = let
a = -a
k = [mean(log1p.(b .* x)) for b in a]
@. log(a / k) - k - 1
end
l_θ = n * lx(θ, x)
w_θ = 1.0 ./ [sum(exp.(l_θ .- l_θ[j])) for j in 1:m]
θ_hat = sum(θ .* w_θ)
ξ = mean(log1p.(-θ_hat .* x))
σ = -ξ ./ θ_hat
ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)
return σ, ξ
end
s = rand(10000)
fit_gpd(s)

Here is a cleaner version IMO, that allocates far less. However, the speed gains are not much. There are 4 main allocations happening in this version:

x = sort(x)

θ = ...

l_θ = similar(θ)

w_θ = similar(l_θ)

If you pre-allocate those outside the function and pass them to the function, you can make even more gains. You can also profile your function and see where it is spending most of the time and try to improve those areas.

using LinearAlgebra, Statistics
function fit_gpd2(x::Vector{T}) where T <: Real
x = sort(x)
x_star = x[round(Int, length(x) / 4)]
n = length(x)
m = 30 + floor(sqrt(n))
θ = Float64[1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3 * x_star) for j in 1:m]
l_θ = similar(θ)
for i in 1:length(l_θ)
b = θ[i]
k = mean(x -> log1p(-b * x), x)
l_θ[i] = n * (log(-b / k) - k - 1)
end
w_θ = similar(l_θ)
for j in 1:length(w_θ)
w_θ[j] = 1 / sum(x -> exp(x - l_θ[j]), l_θ)
end
θ_hat = dot(θ, w_θ)
ξ = mean(x -> log1p(-θ_hat * x), x)
σ = -ξ / θ_hat
ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)
return σ, ξ
end
using BenchmarkTools
s = rand(100000)
@btime fit_gpd($s)
# 343.120 ms (2821 allocations: 266.55 MiB)
#(0.9928484317360206, -0.991738176403473)
@btime fit_gpd2($s)
# 324.617 ms (351 allocations: 800.77 KiB)
#(0.9928484317360206, -0.991738176403473)

In general, I think you are creating a lot of vectors, and I wonder if you can re-write much of it into scalar form. It’s a bit hard to say exactly what to change, but do all of θ, l_θ, w_θ, θ_hat have to be vectors? Can’t you turn the whole thing into a scalar loop?

Also, though this is a small optimization, you are better off using integers more, rather than floats everywhere. For example, replace

floor(Int, length(x) / 4.0 + 0.5)

with

div(length(x) + 2, 4)

and

ξ = ξ * n / (n + 10.0) + 10.0 * 0.5 / (n + 10.0)

with

ξ = (ξ * n + 5) / (n + 10)

etc.

Pretty much all your literals should be integers instead of floats, it’s (marginally) faster, preserves types better, and looks cleaner.

If you really want less allocation a little bit more performance, you should manually “unroll” the broadcast operations inside the loops. This is faster because you know you won’t access memory outside the size of the vectors:

function fit_gpd3(x::Vector{T}) where T <: Real
n = length(x)
x = sort(x)
x_star = x[div(length(x) + 2, 4)]
m = 30 + floor(Int, sqrt(n))
θ = Vector{Float64}(undef, m)
l_θ = Vector{Float64}(undef, m)
w_θ = Vector{Float64}(undef, m)
@inbounds for j in 1:m
θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
end
@inbounds for i in 1:m
b = θ[i]
k = 0.0
@simd for j = 1:n
k += log1p(-b * x[j])
end
k /= n
l_θ[i] = n * (log(-b / k) - k - 1)
end
@inbounds for j in 1:m
aux = 0.0
@simd for i = 1:m
aux += exp(l_θ[i] - l_θ[j])
end
w_θ[j] = 1 / aux
end
θ_hat = dot(θ, w_θ)
ξ = 0.0
@inbounds @simd for i = 1:n
ξ += log1p(-θ_hat * x[i])
end
ξ /= n
σ = -ξ / θ_hat
ξ = (ξ * n + 5 ) / (n + 10)
return σ, ξ
end

Ah! I realize a better method that will not have the sum problem. Look that what you need is two auxiliary vectors, one of size n and other of size m to store the operations (log1p and exp) that happen inside the loops. In this case, you can gain performance and reduce the allocations while still using the sum and mean functions that take care about the sum algorithm:

function fit_gpd4(x::Vector{T}) where T <: Real
n = length(x)
x = sort(x)
x_star = x[div(length(x) + 2, 4)]
m = 30 + floor(Int, sqrt(n))
θ = Vector{Float64}(undef, m)
l_θ = Vector{Float64}(undef, m)
w_θ = Vector{Float64}(undef, m)
aux1 = Vector{Float64}(undef, n)
aux2 = Vector{Float64}(undef, m)
@inbounds for j in 1:m
θ[j] = 1 / x[n] + (1 - sqrt(m / (j - 0.5))) / (3x_star)
end
@inbounds for i in 1:m
b = θ[i]
@. aux1 = log1p(-b * x)
k = mean(aux1)
l_θ[i] = n * (log(-b / k) - k - 1)
end
@inbounds for j in 1:m
@. aux2 = exp.(l_θ - l_θ[j])
w_θ[j] = 1 / sum(aux2)
end
θ_hat = dot(θ, w_θ)
@. aux1 = log1p(-θ_hat * x)
ξ = mean(aux1)
σ = -ξ / θ_hat
ξ = (ξ * n + 5 ) / (n + 10)
return σ, ξ
end