Making multivariate Kolmogorov-Smirnov benchmarks

lrnv · January 24, 2022, 12:46pm

Hey,

I’m trying to produce empirical multivariate Kolmogorov-Smirnov benchs to later benchmark an estimator. I have the following code for the moment :

# The goal of this file is to compute kolmogorov-smirnov p-values of stuff. 
using ProgressMeter
function get_ks_dist(newD,baseD)
    N,d = size(newD)
    dist = 0
    for i in 1:N # index of the sample. 
        x_1 = 0
        x_2 = 0
        for k in 1:N
            x1k = true
            x2k = true
            for l in 1:d
                x1k &= baseD[k,l] < newD[i,l]
                x2k &= newD[k,l] < newD[i,l]
            end
            x_1 += x1k ? 1 : 0
            x_2 += x2k ? 1 : 0
        end
        dist = max(dist, x_1 - x_2, x_2 - x_1)
    end
    return dist/N
end
function make_ks_benchmark(M,get_sample)
    # D is a multivariate distribution. 
    base_sample = get_sample()
    dists = zeros(M)
    @showprogress for i in 1:M
        dists[i] = get_ks_dist(get_sample(),base_sample)
    end
    return sort(dists)
end


function get_ln_indep_sample()
    d = 10
    N = 10000
    reshape(exp.(randn(d*N)),(N,d))
end
bench = make_ks_benchmark(100,get_ln_indep_sample)

And it takes a lot of time. I’ve checked @code_warntype, and it seems like get_ks_dist is type-stable. I was wandering if passing the function as argument was the problem, or if it is just my computations that are crazy long. If so, what can be done ?

Thnaks for any idea

LaurentPlagne · January 24, 2022, 1:02pm

Iterating on consecutive elts seems to speed up the computation
Note that I have changed N to 2000 for convenience

function get_ks_dist(newD,baseD)
    d,N = size(newD)
    dist = 0
    @inbounds for i in 1:N # index of the sample. 
        x_1 = 0
        x_2 = 0
        for k in 1:N
            x1k = true
            x2k = true
            for l in 1:d
                x1k &= baseD[l,k] < newD[l,i]
                x2k &= newD[l,k] < newD[l,i]
            end
            x_1 += Int(x1k)
            x_2 += Int(x2k)
            # x_1 += x1k ? 1 : 0
            # x_2 += x2k ? 1 : 0
        end
        dist = max(dist, x_1 - x_2, x_2 - x_1)
    end
    return dist/N
end
function make_ks_benchmark(M,get_sample)
    # D is a multivariate distribution. 
    base_sample = get_sample()
    dists = zeros(M)
    @showprogress for i in 1:M
        dists[i] = get_ks_dist(get_sample(),base_sample)
    end
    return sort(dists)
end


function get_ln_indep_sample()
    d = 10
    N = 2000

    reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(100,get_ln_indep_sample)

Andrea_Pagnani · January 24, 2022, 1:04pm

In julia the “fast” index for matrices is the leftmost one. You should reshape the matrix such that the innermost loop is. Also an inbound could help

@inbounds for l in 1:d
         x1k &= baseD[l,k] < newD[l,i]
         x2k &= newD[l,k] < newD[l,i]
end

lrnv · January 24, 2022, 1:10pm

That’s indeed a lot better, twice as fast on my machine. Thanks.

Edit: I also threaded the main loop. Surprisingly, even for d=10, the distribution does look a lot like the univariate one from Wikipedia :

function get_ks_dist(newD,baseD)
    d,N = size(newD)
    dist = 0
    @inbounds for i in 1:N # index of the sample. 
        x_1 = 0
        x_2 = 0
        for k in 1:N
            x1k = true
            x2k = true
            for l in 1:d
                x1k &= (baseD[l,k] < newD[l,i])
                x2k &= (newD[l,k] < newD[l,i])
            end
            x_1 += x1k ? 1 : 0
            x_2 += x2k ? 1 : 0
        end
        dist = max(dist, x_1 - x_2, x_2 - x_1)
    end
    return dist/N
end
function make_ks_benchmark(M,get_sample)
    # D is a multivariate distribution. 
    base_sample = get_sample()
    dists = zeros(M)
    p = Progress(M)
    Threads.@threads for i in 1:M
        dists[i] = get_ks_dist(get_sample(),base_sample)
        next!(p)
    end
    return sort(dists)
end


function get_ln_indep_sample()
    d = 10
    N = 1000

    reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(10000,get_ln_indep_sample);
histogram(bench)

plot_2
To compare to the plot there : Kolmogorov–Smirnov test - Wikipedia
My guess is that it is because the variables are independent in this simple example, and something else would append with a dependence structure, although i’m not an expert.

But it is still a lot of computations ! I’d like to put N=10_000 instead of N=1000 so I still lack an order of magnitude. In fact, switching from N=1000 to N=10000 makes my runtime jump from half a minut to 50mins, which is to be expected since the loop is quadratic. So two orders of magnitudes…

LaurentPlagne · January 24, 2022, 2:52pm

Depending on the value of d it may be faster to find a branchless (and simd friendly) way to determine the values of x1k and x2k.
(kind of sorting network)

lrnv · January 24, 2022, 2:54pm

Hum… d goes from 1 to ~ 200 in my applications. I did not tried yet with d bigger than 20, but I’ll have to soon. Thanks for the tip, i’ll take a look.

LaurentPlagne · January 24, 2022, 2:56pm

Well if d is equal to 200 it is pretty sure that a shortcut solution will win: exit the loop before reaching the end

lrnv · January 24, 2022, 2:57pm

Which is why i used all() at the begining instead of this loop, as the all function does take these kind of shortcuts. For d=10 it did not improve things, suprisingly, but i might try again for bigger dimensions.

jstrube · June 26, 2022, 10:01pm

No Julia implementation, yet, but maybe this code might be useful to speed things up: GitHub - pnnl/DDKS: A high-dimensional Kolmogorov-Smirnov distance for comparing high dimensional distributions. There’s also a paper at [2106.13706] Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance. Hope it helps.