Optimizing Code to Fit a Generalized Pareto Distribution to Data

Elrod · June 11, 2021, 8:52am

This:

ξ = mean(log1p.(- θHat .* sample))

could be

function calc_ξ(sample, θHat)
  ξ = zero(promote_type(typeof(θHat), eltype(sample)))
  @turbo for i in eachindex(sample)
      ξ += log1p(- θHat * sample[i])
  end
  ξ / length(sample)
end

Which gets rid of the temporaries.
E.g., try this

using LoopVectorization, Tullio, LinearAlgebra
function calc_ξ(sample, θHat)
  ξ = zero(promote_type(typeof(θHat), eltype(sample)))
  @turbo for i in eachindex(sample)
      ξ += log1p(- θHat * sample[i])
  end
  ξ / length(sample)
end
function gpdfit(sample::AbstractVector, wip::Bool = true, min_grid_pts::Int = 30, sort_sample::Bool = true)
  
  n = length(sample)

  # sample must be sorted, but we can skip if sample is already sorted
  if sort_sample
      sample = sort(sample)
  end

  
  prior = 3.0
  m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
  n_0 = 10.0  # determines how strongly to nudge ξ towards .5
  quartile = sample[(n+2) ÷ 4] 


  # build pointwise estimates of k and θ by using each element of the sample.
  @turbo θHats = @. 1 / sample[n] + (1 - sqrt(m / ($(Base.OneTo(m)) - .5))) / prior / quartile
  @tullio grad=false ξHats[x] := log1p(- θHats[x] * sample[y])
  @turbo logLikelihood = @. n * log(-n * θHats / ξHats) - ξHats - n  # Calculate log-likelihood at each estimate
  @tullio grad=false weights[y] := exp(logLikelihood[x] - logLikelihood[y]) |> inv # Calculate weights from log-likelihood

  θHat = weights ⋅ θHats  # Take the dot product of weights and pointwise estimates of θ to get the full estimate

  ξ = calc_ξ(sample, θHat)

  σ = -ξ / θHat
  # Drag towards .5 to reduce variance for small n
  if wip
    ξ = (ξ * n + .5 * n_0) / (n + n_0) 
  end

  return ξ, σ

end

It’s a little (not much) faster.

That’d require a GPU with good Float64 performance, and relatively large data.

julia> sample = randexp(1000);

julia> @btime gpdfit($sample)
  150.359 μs (5 allocations: 10.19 KiB)
(-0.0026972525200332303, 0.9897913305073889)

julia> Threads.nthreads()
1

Perhaps threading would be the next step, although at the size of sample = randexp(1000) it’d probably require @tturbo or @batch (i.e. Polyester-based threading) to see good gains, instead of @tullio ... threads=....

But this problem is likely to be limited by memory bandwdith, so a GPU’d probably help a lot.

For example:

julia> Threads.nthreads(), Sys.CPU_THREADS # 1 thread used, 18 cores total
(1, 36)

julia> a = rand(Float32, 1024, 1024, 1024);

julia> @benchmark @turbo @. $a = sin($a)
samples: 17; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.95e8  - 2.951e8]  ██████████ 2
 (2.951e8 - 2.953e8]  ███████████████ 3
 (2.953e8 - 2.955e8]  ███████████████ 3
 (2.955e8 - 2.956e8]  ██████████ 2
 (2.956e8 - 2.958e8]  ██████████████████████████████ 6
 (2.958e8 - 2.96e8 ]  █████ 1

                  Counts

min: 294.969 ms (0.00% GC); mean: 295.510 ms (0.00% GC); median: 295.618 ms (0.00% GC); max: 295.986 ms (0.00% GC).

If using 18 cores could make this code 18x faster, it’d be competitive with all the timings reported here.
Instead, using 18 cores doesn’t even make it 3x faster, because the CPU chokes on memory bandwidth.

julia> using LoopVectorization

julia> a = rand(Float32, 1024, 1024, 1024);

julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)

julia> @benchmark @turbo @. $a = sin($a) # uses a single thread
samples: 18; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.913e8 - 2.915e8]  ████████████ 4
 (2.915e8 - 2.917e8]  ██████████████████████████████ 10
 (2.917e8 - 2.919e8]  ███ 1
 (2.919e8 - 2.921e8]  ██████ 2
 (2.921e8 - 2.923e8]   0
 (2.923e8 - 2.926e8]  ███ 1

                  Counts

min: 291.273 ms (0.00% GC); mean: 291.652 ms (0.00% GC); median: 291.605 ms (0.00% GC); max: 292.554 ms (0.00% GC).

julia> @benchmark @tturbo @. $a = sin($a) # uses 18 threads (up to 1 thread per core)
samples: 45; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1.1302e8 - 1.1311e8]  ██████████ 5
 (1.1311e8 - 1.1319e8]  ██████████ 5
 (1.1319e8 - 1.1327e8]  ██████████████████████████████ 15
 (1.1327e8 - 1.1336e8]  ██████████████████████████ 13
 (1.1336e8 - 1.1344e8]  ████████ 4
 (1.1344e8 - 1.1353e8]  ████ 2
 (1.1353e8 - 1.1361e8]  ██ 1

                  Counts

min: 113.021 ms (0.00% GC); mean: 113.264 ms (0.00% GC); median: 113.261 ms (0.00% GC); max: 113.613 ms (0.00% GC).

So I wouldn’t be surprised if a GPU can help here, especially for larger problems.

Topic		Replies	Views
Fast fit of generalized pareto distribution Performance	9	2837	December 18, 2018
Suggestions needed: speed up optimization Performance	31	1965	October 4, 2020
Multivariate Distributions & Parameter Inference New to Julia	1	434	March 24, 2021
Code optimization using SIMD or LoopVectorization Performance question	7	328	November 4, 2023
Optimization question New to Julia	31	440	October 1, 2024

Optimizing Code to Fit a Generalized Pareto Distribution to Data

Related topics