Hi,
I need some help to find a coherent loss to encode the properties i need about my estimator.
I want to split a dataset in two, in the Fourrier sense (the infinite divisibility sense). More precisely, consider that we have a dataset of n independent observations x_1,....x_n from a positive real random variable X. My goal is to find rates a_1,...,a_n all in [0,1] such that, setting
(y_i, z_{i}) = (a_{i}x_{i}, (1-a_i)x_i)
for all i, we will have that y_1,...,y_n,z_1,...z_n are all independent and identically distributed observation. In particular, there must be independence between (y_1,...y_n) and (z_{1},...,z_{n}) (independence, not only zero correlation).
The issue is that I have no loss to express this. Do you know what could i use to check for that ? Here is a quick setup :
n = 10000
x = exp.(randn(n)) # lognormal data, positive. 
a = rand(n) # parameters to be optimized.
y,z = a .* x, (1 .- a) .* x
# How to measure the fact that y and z are i.i.d, 
# as a function of a ? (then to be minimized on a). 
Edit: I have a first shot through the wasserstein distance between (Y,Z) and (Z,shuffle(Y)), as follows:
using OptimalTransport
using Distances
using Tulip
using Random
function otloss(a,x,perm)
    y, z = a .* x, (1 .- a) .* x
    C = pairwise(SqEuclidean(), [y z], [z y[perm]]; dims=1);
    emd2(fill(1/n,n), fill(1/n,n), C, Tulip.Optimizer())
end
otloss(a,x,shuffle(1:n))
# so this loss is random as the shuffle for y must be random too ! 
# so we could integrate it over all possibile shuffles. 
# but that woul dbe very very long. 
# as i need to minimize it on a...
but this is really really slow (i reduced to n=100 for it to even return something) and i am not talking about optimizing on a, only computing the cost once. Maybe there is something smarter to do ?
Furthermore, there is a source of randomness in the shuffling of y, which should really be integrated out (I want ot minimize for every shuffle of y !), increasing again the runtimeā¦
I need something faster.
Edit: I have another expression of this loss, separating the marginal OT problems from the depndence structure. This avoid the problem of shuffling but increases largely the runtime and now provides two loss, which is not ideal either :
function otherloss(a,x)
    n = length(x)
    y, z = a .* x, (1 .- a) .* x
    L1 = wasserstein(DiscreteNonParametric(y,fill(1/n,n)),DiscreteNonParametric(z,fill(1/n,n)))
    ry = sortperm(y)/(n+1)
    rz = sortperm(z)/(n+1)
    rC = pairwise(SqEuclidean(), [ry rz], [repeat(1:n,inner=n) repeat(1:n,outer=n)]./(n+1); dims=1);
    L2 = emd2(fill(1/n,n),fill(1/(n^2),n^2),rC,Tulip.Optimizer())
    return (L1,L2)
end
otherloss(a,x)