Hi,
I am trying to implement a Group KFold function for my problem, where I need to slit the dataset as a function of some label.
There is no function like that in MLDataUtils or MLbase to my knowledge, so I am trying to implement it manually but it’s actually a mess as I can’t find any equivalent of the Pandas or Numpy isin() functions…
(how do we actually grab all values in one array equal to those in another in Julia???)
[edit] Right now I achieved to write the following function:
function gkfolds(X_, idx_label, k = 5)
dd = kfolds(shuffleobs(unique(X_[idx_label,:])), k = k);
out = []
for j = 1:k
train_lab, vald_lab = dd[j]
train_idx = Int64[]
valid_idx = Int64[]
for i = 1:size(X_,2)
if findall(X_[idx_label,i] .== train_lab) != []
push!(train_idx, i)
else
push!(valid_idx, i)
end
end
push!(out,(X_[:,train_idx],X_[:,valid_idx]))
end
return out
end
I created two loops to check if train_lab was found in X_, and then get indices create a final list containing the folds as tuples. It seems to work but is a bit long…
Does anyone has a clean implementation of such function, or some suggestion to do it?