# Making multivariate Kolmogorov-Smirnov benchmarks

Hey,

I’m trying to produce empirical multivariate Kolmogorov-Smirnov benchs to later benchmark an estimator. I have the following code for the moment :

``````# The goal of this file is to compute kolmogorov-smirnov p-values of stuff.
using ProgressMeter
function get_ks_dist(newD,baseD)
N,d = size(newD)
dist = 0
for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= baseD[k,l] < newD[i,l]
x2k &= newD[k,l] < newD[i,l]
end
x_1 += x1k ? 1 : 0
x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
@showprogress for i in 1:M
dists[i] = get_ks_dist(get_sample(),base_sample)
end
return sort(dists)
end

function get_ln_indep_sample()
d = 10
N = 10000
reshape(exp.(randn(d*N)),(N,d))
end
bench = make_ks_benchmark(100,get_ln_indep_sample)
``````

And it takes a lot of time. I’ve checked `@code_warntype`, and it seems like `get_ks_dist` is type-stable. I was wandering if passing the function as argument was the problem, or if it is just my computations that are crazy long. If so, what can be done ?

Thnaks for any idea 1 Like

Iterating on consecutive elts seems to speed up the computation
Note that I have changed N to 2000 for convenience

``````function get_ks_dist(newD,baseD)
d,N = size(newD)
dist = 0
@inbounds for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= baseD[l,k] < newD[l,i]
x2k &= newD[l,k] < newD[l,i]
end
x_1 += Int(x1k)
x_2 += Int(x2k)
# x_1 += x1k ? 1 : 0
# x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
@showprogress for i in 1:M
dists[i] = get_ks_dist(get_sample(),base_sample)
end
return sort(dists)
end

function get_ln_indep_sample()
d = 10
N = 2000

reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(100,get_ln_indep_sample)
``````
1 Like

In julia the “fast” index for matrices is the leftmost one. You should reshape the matrix such that the innermost loop is. Also an inbound could help

``````@inbounds for l in 1:d
x1k &= baseD[l,k] < newD[l,i]
x2k &= newD[l,k] < newD[l,i]
end
``````
2 Likes

That’s indeed a lot better, twice as fast on my machine. Thanks.

Edit: I also threaded the main loop. Surprisingly, even for `d=10`, the distribution does look a lot like the univariate one from Wikipedia :

``````function get_ks_dist(newD,baseD)
d,N = size(newD)
dist = 0
@inbounds for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= (baseD[l,k] < newD[l,i])
x2k &= (newD[l,k] < newD[l,i])
end
x_1 += x1k ? 1 : 0
x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
p = Progress(M)
dists[i] = get_ks_dist(get_sample(),base_sample)
next!(p)
end
return sort(dists)
end

function get_ln_indep_sample()
d = 10
N = 1000

reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(10000,get_ln_indep_sample);
histogram(bench)
`````` To compare to the plot there : Kolmogorov–Smirnov test - Wikipedia
My guess is that it is because the variables are independent in this simple example, and something else would append with a dependence structure, although i’m not an expert.

But it is still a lot of computations ! I’d like to put `N=10_000` instead of `N=1000` so I still lack an order of magnitude. In fact, switching from `N=1000` to `N=10000` makes my runtime jump from half a minut to 50mins, which is to be expected since the loop is quadratic. So two orders of magnitudes…

2 Likes

Depending on the value of `d` it may be faster to find a branchless (and simd friendly) way to determine the values of `x1k` and `x2k`.
(kind of sorting network)

1 Like

Hum… `d` goes from 1 to ~ 200 in my applications. I did not tried yet with `d` bigger than `20`, but I’ll have to soon. Thanks for the tip, i’ll take a look.

Well if d is equal to 200 it is pretty sure that a shortcut solution will win: exit the loop before reaching the end

Which is why i used `all()` at the begining instead of this loop, as the `all` function does take these kind of shortcuts. For `d=10` it did not improve things, suprisingly, but i might try again for bigger dimensions.

No Julia implementation, yet, but maybe this code might be useful to speed things up: GitHub - pnnl/DDKS: A high-dimensional Kolmogorov-Smirnov distance for comparing high dimensional distributions. There’s also a paper at [2106.13706] Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance. Hope it helps.

2 Likes