Optimizing Code to Fit a Generalized Pareto Distribution to Data

I’m trying to optimize the following very performance-critical piece of code:

using Memoization, Statistics, LinearAlgebra, Tullio, LoopVectorization

"""
    gpdfit(sample::AbstractArray, wip::Bool = true, min_grid_pts::Int = 30, wip = true, sort_sample::Bool = true)

Given a sample, estimate the parameters \$ξ\$ and \$\\sigma\$ of the  
generalized Pareto distribution (GPD), assuming the location parameter
is 0. The fit uses a weak prior for \$ξ\$, which will stabilize estimates 
for very small sample sizes (and low effective sample sizes in the case of 
MCMC samples). The weakly informative prior is a Gaussian centered at 0.5. 


# Arguments
- `sample::AbstractArray`: A numeric vector. The sample from which to estimate the parameters.
- `wip::Bool = true`: Logical indicating whether to adjust \$ξ\$ based on a weakly
  informative Gaussian prior centered on 0.5. Defaults to true.
- `min_grid_pts::Int = 30`: The minimum number of grid points used in the fitting
  algorithm. The actual number used is `min_grid_pts + floor(sqrt(length(sample)))`.
- `sort_sample::Bool = true`: If `true`, the first step in the fitting
  algorithm is to sort the elements of `sample`. If `sample` is already
  sorted in ascending order then `sort_sample` can be set to `false` to
  skip the initial sorting step.


# Returns
- `ξ, σ`: The estimated parameters of the generalized Pareto distribution.

# Note
The parameter \$ξ\$ is the negative of \$k\$ in [zhangNewEfficientEstimation2009](@cite).
A slightly different quantile interpolation is used than in the paper.
"""
function gpdfit(sample::AbstractVector, wip::Bool = true, min_grid_pts::Int = 30, sort_sample::Bool = true)
  
  n = length(sample)

  # sample must be sorted, but we can skip if sample is already sorted
  if sort_sample
      sample = sort(sample)
  end

  
  prior = 3.0
  m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
  n_0 = 10.0  # determines how strongly to nudge ξ towards .5
  quartile = sample[(n+2) ÷ 4] 


  # build pointwise estimates of k and θ by using each element of the sample.
  @turbo θHats = @. 1 / sample[n] + (1 - sqrt(m / (Base.OneTo(m) - .5))) / prior / quartile
  @tullio grad=false ξHats[x] := log1p(- θHats[x] * sample[y])
  @turbo logLikelihood = @. n * log(-n * θHats / ξHats) - ξHats - n  # Calculate log-likelihood at each estimate
  @tullio grad=false weights[y] := exp(logLikelihood[x] - logLikelihood[y]) |> inv # Calculate weights from log-likelihood

  θHat = weights ⋅ θHats  # Take the dot product of weights and pointwise estimates of θ to get the full estimate

  ξ = mean(log1p.(- θHat .* sample))
  σ = -ξ / θHat
  # Drag towards .5 to reduce variance for small n
  if wip
    ξ = (ξ * n + .5 * n_0) / (n + n_0) 
  end

  return ξ, σ

end

A representative example of the kind of input this might be used on is randexp(1000). Are there any further improvements that could be made?

1 Like

You can @turbo these special functions.

3 Likes

Do you need to collect this? You can also use Base.OneTo(m)

2 Likes

Is annotating them with @turbo enough or is something else necessary?

I’ve fixed this now; didn’t know about this function, thanks! At the same time I think this isn’t exactly going to be the biggest performance bottleneck – the vast majority of the time is spent in the @tullio expressions.

Totally lol, it just stands out when you skim the code… maybe there are more allocations like that that you don’t need, but I don’t know Tullio syntax.

Also just getting memory footprint down may help everything, depending on your array sizes. You want to fit as much as possible in L1 cache, then L2, then L3. Etc, as L1 is much faster than L3.

how about GPU these with CUDA?

This:

ξ = mean(log1p.(- θHat .* sample))

could be

function calc_ξ(sample, θHat)
  ξ = zero(promote_type(typeof(θHat), eltype(sample)))
  @turbo for i in eachindex(sample)
      ξ += log1p(- θHat * sample[i])
  end
  ξ / length(sample)
end

Which gets rid of the temporaries.
E.g., try this

using LoopVectorization, Tullio, LinearAlgebra
function calc_ξ(sample, θHat)
  ξ = zero(promote_type(typeof(θHat), eltype(sample)))
  @turbo for i in eachindex(sample)
      ξ += log1p(- θHat * sample[i])
  end
  ξ / length(sample)
end
function gpdfit(sample::AbstractVector, wip::Bool = true, min_grid_pts::Int = 30, sort_sample::Bool = true)
  
  n = length(sample)

  # sample must be sorted, but we can skip if sample is already sorted
  if sort_sample
      sample = sort(sample)
  end

  
  prior = 3.0
  m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
  n_0 = 10.0  # determines how strongly to nudge ξ towards .5
  quartile = sample[(n+2) ÷ 4] 


  # build pointwise estimates of k and θ by using each element of the sample.
  @turbo θHats = @. 1 / sample[n] + (1 - sqrt(m / ($(Base.OneTo(m)) - .5))) / prior / quartile
  @tullio grad=false ξHats[x] := log1p(- θHats[x] * sample[y])
  @turbo logLikelihood = @. n * log(-n * θHats / ξHats) - ξHats - n  # Calculate log-likelihood at each estimate
  @tullio grad=false weights[y] := exp(logLikelihood[x] - logLikelihood[y]) |> inv # Calculate weights from log-likelihood

  θHat = weights ⋅ θHats  # Take the dot product of weights and pointwise estimates of θ to get the full estimate

  ξ = calc_ξ(sample, θHat)

  σ = -ξ / θHat
  # Drag towards .5 to reduce variance for small n
  if wip
    ξ = (ξ * n + .5 * n_0) / (n + n_0) 
  end

  return ξ, σ

end

It’s a little (not much) faster.

That’d require a GPU with good Float64 performance, and relatively large data.

julia> sample = randexp(1000);

julia> @btime gpdfit($sample)
  150.359 μs (5 allocations: 10.19 KiB)
(-0.0026972525200332303, 0.9897913305073889)

julia> Threads.nthreads()
1

Perhaps threading would be the next step, although at the size of sample = randexp(1000) it’d probably require @tturbo or @batch (i.e. Polyester-based threading) to see good gains, instead of @tullio ... threads=....

But this problem is likely to be limited by memory bandwdith, so a GPU’d probably help a lot.

For example:

julia> Threads.nthreads(), Sys.CPU_THREADS # 1 thread used, 18 cores total
(1, 36)

julia> a = rand(Float32, 1024, 1024, 1024);

julia> @benchmark @turbo @. $a = sin($a)
samples: 17; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.95e8  - 2.951e8]  ██████████ 2
 (2.951e8 - 2.953e8]  ███████████████ 3
 (2.953e8 - 2.955e8]  ███████████████ 3
 (2.955e8 - 2.956e8]  ██████████ 2
 (2.956e8 - 2.958e8]  ██████████████████████████████ 6
 (2.958e8 - 2.96e8 ]  █████ 1

                  Counts

min: 294.969 ms (0.00% GC); mean: 295.510 ms (0.00% GC); median: 295.618 ms (0.00% GC); max: 295.986 ms (0.00% GC).

If using 18 cores could make this code 18x faster, it’d be competitive with all the timings reported here.
Instead, using 18 cores doesn’t even make it 3x faster, because the CPU chokes on memory bandwidth.

julia> using LoopVectorization

julia> a = rand(Float32, 1024, 1024, 1024);

julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)

julia> @benchmark @turbo @. $a = sin($a) # uses a single thread
samples: 18; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.913e8 - 2.915e8]  ████████████ 4
 (2.915e8 - 2.917e8]  ██████████████████████████████ 10
 (2.917e8 - 2.919e8]  ███ 1
 (2.919e8 - 2.921e8]  ██████ 2
 (2.921e8 - 2.923e8]   0
 (2.923e8 - 2.926e8]  ███ 1

                  Counts

min: 291.273 ms (0.00% GC); mean: 291.652 ms (0.00% GC); median: 291.605 ms (0.00% GC); max: 292.554 ms (0.00% GC).

julia> @benchmark @tturbo @. $a = sin($a) # uses 18 threads (up to 1 thread per core)
samples: 45; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1.1302e8 - 1.1311e8]  ██████████ 5
 (1.1311e8 - 1.1319e8]  ██████████ 5
 (1.1319e8 - 1.1327e8]  ██████████████████████████████ 15
 (1.1327e8 - 1.1336e8]  ██████████████████████████ 13
 (1.1336e8 - 1.1344e8]  ████████ 4
 (1.1344e8 - 1.1353e8]  ████ 2
 (1.1353e8 - 1.1361e8]  ██ 1

                  Counts

min: 113.021 ms (0.00% GC); mean: 113.264 ms (0.00% GC); median: 113.261 ms (0.00% GC); max: 113.613 ms (0.00% GC).

So I wouldn’t be surprised if a GPU can help here, especially for larger problems.

1 Like

This is great, thank you so much!

As for whether GPUs can help – the function is likely to be called a lot of times on multiple different data sets, each of which is an array in the range of ~100-1000 points. From the very little I know about GPUs, I think this means that GPUs would be useful at the next level up, but not within the body of the function itself. I think multithreading is the same – it would be better to run multiple threads, each of which calls this function on a different data set. Please correct me if I’m wrong – I know very little about GPUs/multithreading.

Yes, you should prefer the more granular threading.
In that case, I would also add threads=false to each instance of @tullio:

using LoopVectorization, Tullio, LinearAlgebra
function calc_ξ(sample, θHat)
  ξ = zero(promote_type(typeof(θHat), eltype(sample)))
  @turbo for i in eachindex(sample)
      ξ += log1p(- θHat * sample[i])
  end
  ξ / length(sample)
end
function gpdfit(sample::AbstractVector, wip::Bool = true, min_grid_pts::Int = 30, sort_sample::Bool = true)
  
  n = length(sample)

  # sample must be sorted, but we can skip if sample is already sorted
  if sort_sample
      sample = sort(sample)
  end

  
  prior = 3.0
  m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
  n_0 = 10.0  # determines how strongly to nudge ξ towards .5
  quartile = sample[(n+2) ÷ 4] 


  # build pointwise estimates of k and θ by using each element of the sample.
  @turbo θHats = @. 1 / sample[n] + (1 - sqrt(m / ($(Base.OneTo(m)) - .5))) / prior / quartile
  @tullio grad=false ξHats[x] := log1p(- θHats[x] * sample[y])  threads=false
  @turbo logLikelihood = @. n * log(-n * θHats / ξHats) - ξHats - n  # Calculate log-likelihood at each estimate
  @tullio grad=false weights[y] := exp(logLikelihood[x] - logLikelihood[y]) |> inv threads=false # Calculate weights from log-likelihood

  θHat = weights ⋅ θHats  # Take the dot product of weights and pointwise estimates of θ to get the full estimate

  ξ = calc_ξ(sample, θHat)

  σ = -ξ / θHat
  # Drag towards .5 to reduce variance for small n
  if wip
    ξ = (ξ * n + .5 * n_0) / (n + n_0) 
  end

  return ξ, σ

end

Sorry, what do you mean by “More granular?” To me that sounds like you’re saying it would be better to multithread everything as finely as possible, but then you suggested turning off multithreading for Tullio.

Just to note, you could also have used

seq = 1:m

not need to collect it to iterate over it. I don’t know if Base.OneTo has any advantage over that, does it @Raf ?

Well, the help says this:

help?> Base.OneTo
  Base.OneTo(n)

  Define an AbstractUnitRange that behaves like 1:n, with the added distinction that the lower limit is guaranteed (by the type
  system) to be 1.

But I do not understand what that means exactly relative to 1:n (or in which situation that would bring some benefit).

By granular, I mean less fine. That is, do this:

and thus make sure these individual calls don’t use threads.

Probably no real benefit when used at the call site. If you’re using 1:n vs Base.OneTo(n), in both cases the 1 is known at compile time.
Technically different if you pass the 1:n across a function boundary so that the 1 is no longer known at compile time, but the difference will be so small that it’d be difficult to measure.

3 Likes

Btw, Distributions.jl currently doesn’t fit a GPD.
It’d be great to add this: Distributions.jl/generalizedpareto.jl at master · JuliaStats/Distributions.jl · GitHub

Youre right, its an irrelevent distraction in this case.

The main point was removing collect and other allocations to reduce memory footprint.

1 Like

Basically make each thread do a a lot work to amortise the cost of threading.

Note, if you make the job size too large in memory footprint threads may start competing for shared cache, and at some stage threading a smaller problem effeciently can scale better than larger batches. But you need to benchmark to know where that point is, and getting all memory use down will help to move it.

Feel free to use any of this code in Distributions.jl, just make sure to cite the original authors. I’m a bit too busy at the moment to add it myself, but copy/pasting it shouldn’t be much work.

Thanks! Do you know where I can read up on how to switch this code to use a GPU (or rather, how to run the whole function on a GPU), and roughly how much work would it be?