I’m trying to produce empirical multivariate Kolmogorov-Smirnov benchs to later benchmark an estimator. I have the following code for the moment :
# The goal of this file is to compute kolmogorov-smirnov p-values of stuff.
using ProgressMeter
function get_ks_dist(newD,baseD)
N,d = size(newD)
dist = 0
for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= baseD[k,l] < newD[i,l]
x2k &= newD[k,l] < newD[i,l]
end
x_1 += x1k ? 1 : 0
x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
@showprogress for i in 1:M
dists[i] = get_ks_dist(get_sample(),base_sample)
end
return sort(dists)
end
function get_ln_indep_sample()
d = 10
N = 10000
reshape(exp.(randn(d*N)),(N,d))
end
bench = make_ks_benchmark(100,get_ln_indep_sample)
And it takes a lot of time. I’ve checked @code_warntype, and it seems like get_ks_dist is type-stable. I was wandering if passing the function as argument was the problem, or if it is just my computations that are crazy long. If so, what can be done ?
Iterating on consecutive elts seems to speed up the computation
Note that I have changed N to 2000 for convenience
function get_ks_dist(newD,baseD)
d,N = size(newD)
dist = 0
@inbounds for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= baseD[l,k] < newD[l,i]
x2k &= newD[l,k] < newD[l,i]
end
x_1 += Int(x1k)
x_2 += Int(x2k)
# x_1 += x1k ? 1 : 0
# x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
@showprogress for i in 1:M
dists[i] = get_ks_dist(get_sample(),base_sample)
end
return sort(dists)
end
function get_ln_indep_sample()
d = 10
N = 2000
reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(100,get_ln_indep_sample)
That’s indeed a lot better, twice as fast on my machine. Thanks.
Edit: I also threaded the main loop. Surprisingly, even for d=10, the distribution does look a lot like the univariate one from Wikipedia :
function get_ks_dist(newD,baseD)
d,N = size(newD)
dist = 0
@inbounds for i in 1:N # index of the sample.
x_1 = 0
x_2 = 0
for k in 1:N
x1k = true
x2k = true
for l in 1:d
x1k &= (baseD[l,k] < newD[l,i])
x2k &= (newD[l,k] < newD[l,i])
end
x_1 += x1k ? 1 : 0
x_2 += x2k ? 1 : 0
end
dist = max(dist, x_1 - x_2, x_2 - x_1)
end
return dist/N
end
function make_ks_benchmark(M,get_sample)
# D is a multivariate distribution.
base_sample = get_sample()
dists = zeros(M)
p = Progress(M)
Threads.@threads for i in 1:M
dists[i] = get_ks_dist(get_sample(),base_sample)
next!(p)
end
return sort(dists)
end
function get_ln_indep_sample()
d = 10
N = 1000
reshape(exp.(randn(d*N)),(d,N))
end
@time bench = make_ks_benchmark(10000,get_ln_indep_sample);
histogram(bench)
To compare to the plot there : Kolmogorov–Smirnov test - Wikipedia
My guess is that it is because the variables are independent in this simple example, and something else would append with a dependence structure, although i’m not an expert.
But it is still a lot of computations ! I’d like to put N=10_000 instead of N=1000 so I still lack an order of magnitude. In fact, switching from N=1000 to N=10000 makes my runtime jump from half a minut to 50mins, which is to be expected since the loop is quadratic. So two orders of magnitudes…
Depending on the value of d it may be faster to find a branchless (and simd friendly) way to determine the values of x1k and x2k.
(kind of sorting network)
Hum… d goes from 1 to ~ 200 in my applications. I did not tried yet with d bigger than 20, but I’ll have to soon. Thanks for the tip, i’ll take a look.
Which is why i used all() at the begining instead of this loop, as the all function does take these kind of shortcuts. For d=10 it did not improve things, suprisingly, but i might try again for bigger dimensions.