Help removing allocations for "enrichment score" function

kevbonham · September 26, 2023, 4:14pm

I have a function that calculates an “enrichment score” for a set of values - basically, the skew of in-set vs out-of-set values. A naive implementation (it’s sort of a literal transcription of the method) looks like this:

function enrichment_score(setcors, notcors)
    srt = sortperm([setcors; notcors]; rev=true)
    ranks = invperm(srt)
    setranks = Set(ranks[1:length(setcors)])
    
    setscore =  1 / length(setcors)
    notscore = -1 / length(notcors)
    
    ys = cumsum(i ∈ setranks ? setscore : notscore for i in eachindex(ranks))
    
    lower, upper = extrema(ys)
    return abs(lower) > abs(upper) ? lower : upper
end

The idea is that we rank-order all values, and then walk through the values, and if the value is part of the in-set group, we add 1 / length(in-set), and if not, we subtract 1 / length(out-of-set). So basically if all of the in-set values are on the low end of the values, the E.S. would be +1, and if they’re all on the positive end, it would be -1. Eg

julia> enrichment_score([3,4,1], 1:10)
-0.6

julia> enrichment_score([10,8,7], 1:10)
0.7

This works fine when I only need to run it a handful of times, but I’m trying out a permutation test kind of thing where I need to do it thousands of times, and the time is really starting to drag.

One option that works ok for the permutation test is to pre-sort and just permute indexes, but I’m wondering if there’s a more efficient way to accomplish esp the first 3 lines of the function so that I’m not re-allocating these giant vectors (my inputs are 100s of thousands of values).

jakobnissen · September 26, 2023, 6:57pm

You can implement quickselect to find p which is the length(setcors)'th highest score in O(N). Then you can traverse setcors and notcors and simply check if the scores are higher or lower than p.

Alternatively, concatenate setcors and notcors, sort in-place and simply select p directly from the index. This is O(N * log(N)), but it’s simpler than implementing the above

I.e. something like this

function enrichment_score(
    positives::AbstractVector{T},
    negatives::AbstractVector{T},
    buffer::Vector{T}
) where {T <: Real}
    resize!(buffer, length(positives) + length(negatives))
    copyto!(buffer, 1, positives, firstindex(positives), length(positives))
    copyto!(buffer, length(positives) + 1, negatives, firstindex(negatives), length(negatives))
    sort!(buffer; rev=true)
    pivot = buffer[length(positives)]
    p_weight = 0.5 / length(positives)
    n_weight = 0.5 / length(negatives)
    result = 0.0
    for i in positives
        result += (-1 + 2 * (i ≥ pivot)) * p_weight
     end
     for i in negatives
         result += (-1 + 2 * (i < pivot)) * n_weight
     end
     result
end

You could enhance this by using another buffer for the sorting which itself uses a buffer.
Also, I’m not 100% sure this is equivalent to your code, but it’s the same idea.
Also, I haven’t micro-optimised this, but you get the overall idea.

Topic		Replies	Views
Efficient way to apply monotonous rank transformation to an array Performance	4	469	April 16, 2020
Timeseries Optimization : Reducing allocations Performance	10	499	May 18, 2021
How to remove extra allocation when doing point-wise assignment? General Usage question	2	460	September 12, 2018
Can I decrease allocations here? New to Julia allocations	18	749	March 25, 2022
Help implementing the Elo Rating System in Julia? General Usage array	3	866	December 26, 2019

Help removing allocations for "enrichment score" function

Related topics