# Group KFold?

Hi,

I am trying to implement a Group KFold function for my problem, where I need to slit the dataset as a function of some label.

There is no function like that in MLDataUtils or MLbase to my knowledge, so I am trying to implement it manually but it’s actually a mess as I can’t find any equivalent of the Pandas or Numpy isin() functions…

(how do we actually grab all values in one array equal to those in another in Julia???)

 Right now I achieved to write the following function:

``````function gkfolds(X_, idx_label, k = 5)

dd = kfolds(shuffleobs(unique(X_[idx_label,:])), k = k);

out = []

for j = 1:k
train_lab, vald_lab = dd[j]
train_idx = Int64[]
valid_idx = Int64[]

for i = 1:size(X_,2)
if findall(X_[idx_label,i] .== train_lab) != []
push!(train_idx, i)
else
push!(valid_idx, i)
end
end

push!(out,(X_[:,train_idx],X_[:,valid_idx]))

end

return out
end

``````

I created two loops to check if train_lab was found in X_, and then get indices create a final list containing the folds as tuples. It seems to work but is a bit long…

Does anyone has a clean implementation of such function, or some suggestion to do it?

I usually use `in` with `filter`

``````vals = ["c","b"] # values of interest
arr = ["a","b","c","d"] # array with data
filter(x -> in(x, vals), arr) # select values of interest
``````

But you could also do `in.(arr, [vals])` to get a BitArray (array of true/false values) and use it for filtering: `arr[in.(arr, [vals])]`. Notice that the second argument for `in.()` is an array of arrays.

Once you know how to filter it should be easy to continue.

But if you have heterogeneous data it might be nicer to work with DataFrames which has a convenient `groupby` function to split a `DataFrame` into groups which you can iterate over and process as needed.

1 Like

Thanks I will try to modify the code and see how it goes. The double loop is actually not too bad as my dataset is small (does not take time to compute), but your suggestion may be nicer!

You don’t need to worry about loops in Julia (like you would in Python or R, for example). In Julia it’s all fast (as long as you follow the performance tips from documentation).

True! I spent a lot of time programming in Python lately and I guess I have this reflex right now of avoiding loops at all costs 