This:
ξ = mean(log1p.(- θHat .* sample))
could be
function calc_ξ(sample, θHat)
ξ = zero(promote_type(typeof(θHat), eltype(sample)))
@turbo for i in eachindex(sample)
ξ += log1p(- θHat * sample[i])
end
ξ / length(sample)
end
Which gets rid of the temporaries.
E.g., try this
using LoopVectorization, Tullio, LinearAlgebra
function calc_ξ(sample, θHat)
ξ = zero(promote_type(typeof(θHat), eltype(sample)))
@turbo for i in eachindex(sample)
ξ += log1p(- θHat * sample[i])
end
ξ / length(sample)
end
function gpdfit(sample::AbstractVector, wip::Bool = true, min_grid_pts::Int = 30, sort_sample::Bool = true)
n = length(sample)
# sample must be sorted, but we can skip if sample is already sorted
if sort_sample
sample = sort(sample)
end
prior = 3.0
m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
n_0 = 10.0 # determines how strongly to nudge ξ towards .5
quartile = sample[(n+2) ÷ 4]
# build pointwise estimates of k and θ by using each element of the sample.
@turbo θHats = @. 1 / sample[n] + (1 - sqrt(m / ($(Base.OneTo(m)) - .5))) / prior / quartile
@tullio grad=false ξHats[x] := log1p(- θHats[x] * sample[y])
@turbo logLikelihood = @. n * log(-n * θHats / ξHats) - ξHats - n # Calculate log-likelihood at each estimate
@tullio grad=false weights[y] := exp(logLikelihood[x] - logLikelihood[y]) |> inv # Calculate weights from log-likelihood
θHat = weights ⋅ θHats # Take the dot product of weights and pointwise estimates of θ to get the full estimate
ξ = calc_ξ(sample, θHat)
σ = -ξ / θHat
# Drag towards .5 to reduce variance for small n
if wip
ξ = (ξ * n + .5 * n_0) / (n + n_0)
end
return ξ, σ
end
It’s a little (not much) faster.
That’d require a GPU with good Float64
performance, and relatively large data.
julia> sample = randexp(1000);
julia> @btime gpdfit($sample)
150.359 μs (5 allocations: 10.19 KiB)
(-0.0026972525200332303, 0.9897913305073889)
julia> Threads.nthreads()
1
Perhaps threading would be the next step, although at the size of sample = randexp(1000)
it’d probably require @tturbo
or @batch
(i.e. Polyester-based threading) to see good gains, instead of @tullio ... threads=...
.
But this problem is likely to be limited by memory bandwdith, so a GPU’d probably help a lot.
For example:
julia> Threads.nthreads(), Sys.CPU_THREADS # 1 thread used, 18 cores total
(1, 36)
julia> a = rand(Float32, 1024, 1024, 1024);
julia> @benchmark @turbo @. $a = sin($a)
samples: 17; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(2.95e8 - 2.951e8] ██████████ 2
(2.951e8 - 2.953e8] ███████████████ 3
(2.953e8 - 2.955e8] ███████████████ 3
(2.955e8 - 2.956e8] ██████████ 2
(2.956e8 - 2.958e8] ██████████████████████████████ 6
(2.958e8 - 2.96e8 ] █████ 1
Counts
min: 294.969 ms (0.00% GC); mean: 295.510 ms (0.00% GC); median: 295.618 ms (0.00% GC); max: 295.986 ms (0.00% GC).
If using 18 cores could make this code 18x faster, it’d be competitive with all the timings reported here.
Instead, using 18 cores doesn’t even make it 3x faster, because the CPU chokes on memory bandwidth.
julia> using LoopVectorization
julia> a = rand(Float32, 1024, 1024, 1024);
julia> Threads.nthreads(), Sys.CPU_THREADS
(36, 36)
julia> @benchmark @turbo @. $a = sin($a) # uses a single thread
samples: 18; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(2.913e8 - 2.915e8] ████████████ 4
(2.915e8 - 2.917e8] ██████████████████████████████ 10
(2.917e8 - 2.919e8] ███ 1
(2.919e8 - 2.921e8] ██████ 2
(2.921e8 - 2.923e8] 0
(2.923e8 - 2.926e8] ███ 1
Counts
min: 291.273 ms (0.00% GC); mean: 291.652 ms (0.00% GC); median: 291.605 ms (0.00% GC); max: 292.554 ms (0.00% GC).
julia> @benchmark @tturbo @. $a = sin($a) # uses 18 threads (up to 1 thread per core)
samples: 45; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.1302e8 - 1.1311e8] ██████████ 5
(1.1311e8 - 1.1319e8] ██████████ 5
(1.1319e8 - 1.1327e8] ██████████████████████████████ 15
(1.1327e8 - 1.1336e8] ██████████████████████████ 13
(1.1336e8 - 1.1344e8] ████████ 4
(1.1344e8 - 1.1353e8] ████ 2
(1.1353e8 - 1.1361e8] ██ 1
Counts
min: 113.021 ms (0.00% GC); mean: 113.264 ms (0.00% GC); median: 113.261 ms (0.00% GC); max: 113.613 ms (0.00% GC).
So I wouldn’t be surprised if a GPU can help here, especially for larger problems.