I have a function that calculates an “enrichment score” for a set of values - basically, the skew of in-set vs out-of-set values. A naive implementation (it’s sort of a literal transcription of the method) looks like this:
function enrichment_score(setcors, notcors)
srt = sortperm([setcors; notcors]; rev=true)
ranks = invperm(srt)
setranks = Set(ranks[1:length(setcors)])
setscore = 1 / length(setcors)
notscore = -1 / length(notcors)
ys = cumsum(i ∈ setranks ? setscore : notscore for i in eachindex(ranks))
lower, upper = extrema(ys)
return abs(lower) > abs(upper) ? lower : upper
end
The idea is that we rank-order all values, and then walk through the values, and if the value is part of the in-set group, we add 1 / length(in-set)
, and if not, we subtract 1 / length(out-of-set)
. So basically if all of the in-set values are on the low end of the values, the E.S. would be +1, and if they’re all on the positive end, it would be -1. Eg
julia> enrichment_score([3,4,1], 1:10)
-0.6
julia> enrichment_score([10,8,7], 1:10)
0.7
This works fine when I only need to run it a handful of times, but I’m trying out a permutation test kind of thing where I need to do it thousands of times, and the time is really starting to drag.
One option that works ok for the permutation test is to pre-sort and just permute indexes, but I’m wondering if there’s a more efficient way to accomplish esp the first 3 lines of the function so that I’m not re-allocating these giant vectors (my inputs are 100s of thousands of values).