What loss to enforce the `i.i.d`ness of a dataset?

lrnv · March 8, 2023, 12:11pm

Hi,

I need some help to find a coherent loss to encode the properties i need about my estimator.

I want to split a dataset in two, in the Fourrier sense (the infinite divisibility sense). More precisely, consider that we have a dataset of n independent observations x_1,....x_n from a positive real random variable X. My goal is to find rates a_1,...,a_n all in [0,1] such that, setting

(y_i, z_{i}) = (a_{i}x_{i}, (1-a_i)x_i)

for all i, we will have that y_1,...,y_n,z_1,...z_n are all independent and identically distributed observation. In particular, there must be independence between (y_1,...y_n) and (z_{1},...,z_{n}) (independence, not only zero correlation).

The issue is that I have no loss to express this. Do you know what could i use to check for that ? Here is a quick setup :

n = 10000
x = exp.(randn(n)) # lognormal data, positive. 
a = rand(n) # parameters to be optimized.
y,z = a .* x, (1 .- a) .* x

# How to measure the fact that y and z are i.i.d, 
# as a function of a ? (then to be minimized on a).

Edit: I have a first shot through the wasserstein distance between (Y,Z) and (Z,shuffle(Y)), as follows:


using OptimalTransport
using Distances
using Tulip
using Random
function otloss(a,x,perm)
    y, z = a .* x, (1 .- a) .* x
    C = pairwise(SqEuclidean(), [y z], [z y[perm]]; dims=1);
    emd2(fill(1/n,n), fill(1/n,n), C, Tulip.Optimizer())
end

otloss(a,x,shuffle(1:n))
# so this loss is random as the shuffle for y must be random too ! 
# so we could integrate it over all possibile shuffles. 
# but that woul dbe very very long. 
# as i need to minimize it on a...

but this is really really slow (i reduced to n=100 for it to even return something) and i am not talking about optimizing on a, only computing the cost once. Maybe there is something smarter to do ?

Furthermore, there is a source of randomness in the shuffling of y, which should really be integrated out (I want ot minimize for every shuffle of y !), increasing again the runtime…

I need something faster.

Edit: I have another expression of this loss, separating the marginal OT problems from the depndence structure. This avoid the problem of shuffling but increases largely the runtime and now provides two loss, which is not ideal either :

function otherloss(a,x)
    n = length(x)
    y, z = a .* x, (1 .- a) .* x
    L1 = wasserstein(DiscreteNonParametric(y,fill(1/n,n)),DiscreteNonParametric(z,fill(1/n,n)))
    ry = sortperm(y)/(n+1)
    rz = sortperm(z)/(n+1)
    rC = pairwise(SqEuclidean(), [ry rz], [repeat(1:n,inner=n) repeat(1:n,outer=n)]./(n+1); dims=1);
    L2 = emd2(fill(1/n,n),fill(1/(n^2),n^2),rC,Tulip.Optimizer())
    return (L1,L2)
end
otherloss(a,x)

Topic		Replies	Views
Optimize 0-1 loss in MIP Optimization (Mathematical)	5	541	May 2, 2020
Amortized Hierarchical Variational Model Probabilistic programming flux , turing , machine-learning	4	462	September 17, 2020
How to calculate loss in a given time budget? General Usage diffeq , sciml	6	529	October 18, 2020
Computing discrete Wasserstein (EMD) distance in Julia? Machine Learning question	16	4622	December 20, 2021
Cannot differentiate through a multivariate log normal probability density with Zygote Machine Learning flux , zygote	4	154	December 11, 2024

What loss to enforce the `i.i.d`ness of a dataset?

Related topics