Help removing allocations for "enrichment score" function

I have a function that calculates an “enrichment score” for a set of values - basically, the skew of in-set vs out-of-set values. A naive implementation (it’s sort of a literal transcription of the method) looks like this:

function enrichment_score(setcors, notcors)
    srt = sortperm([setcors; notcors]; rev=true)
    ranks = invperm(srt)
    setranks = Set(ranks[1:length(setcors)])
    
    setscore =  1 / length(setcors)
    notscore = -1 / length(notcors)
    
    ys = cumsum(i ∈ setranks ? setscore : notscore for i in eachindex(ranks))
    
    lower, upper = extrema(ys)
    return abs(lower) > abs(upper) ? lower : upper
end

The idea is that we rank-order all values, and then walk through the values, and if the value is part of the in-set group, we add 1 / length(in-set), and if not, we subtract 1 / length(out-of-set). So basically if all of the in-set values are on the low end of the values, the E.S. would be +1, and if they’re all on the positive end, it would be -1. Eg

julia> enrichment_score([3,4,1], 1:10)
-0.6

julia> enrichment_score([10,8,7], 1:10)
0.7

This works fine when I only need to run it a handful of times, but I’m trying out a permutation test kind of thing where I need to do it thousands of times, and the time is really starting to drag.

One option that works ok for the permutation test is to pre-sort and just permute indexes, but I’m wondering if there’s a more efficient way to accomplish esp the first 3 lines of the function so that I’m not re-allocating these giant vectors (my inputs are 100s of thousands of values).

You can implement quickselect to find p which is the length(setcors)'th highest score in O(N). Then you can traverse setcors and notcors and simply check if the scores are higher or lower than p.

Alternatively, concatenate setcors and notcors, sort in-place and simply select p directly from the index. This is O(N * log(N)), but it’s simpler than implementing the above

I.e. something like this

function enrichment_score(
    positives::AbstractVector{T},
    negatives::AbstractVector{T},
    buffer::Vector{T}
) where {T <: Real}
    resize!(buffer, length(positives) + length(negatives))
    copyto!(buffer, 1, positives, firstindex(positives), length(positives))
    copyto!(buffer, length(positives) + 1, negatives, firstindex(negatives), length(negatives))
    sort!(buffer; rev=true)
    pivot = buffer[length(positives)]
    p_weight = 0.5 / length(positives)
    n_weight = 0.5 / length(negatives)
    result = 0.0
    for i in positives
        result += (-1 + 2 * (i ≥ pivot)) * p_weight
     end
     for i in negatives
         result += (-1 + 2 * (i < pivot)) * n_weight
     end
     result
end

You could enhance this by using another buffer for the sorting which itself uses a buffer.
Also, I’m not 100% sure this is equivalent to your code, but it’s the same idea.
Also, I haven’t micro-optimised this, but you get the overall idea.

1 Like