Group KFold?

Hi,

I am trying to implement a Group KFold function for my problem, where I need to slit the dataset as a function of some label.

There is no function like that in MLDataUtils or MLbase to my knowledge, so I am trying to implement it manually but it’s actually a mess as I can’t find any equivalent of the Pandas or Numpy isin() functions…

(how do we actually grab all values in one array equal to those in another in Julia???)

[edit] Right now I achieved to write the following function:

function gkfolds(X_, idx_label, k = 5)

    dd = kfolds(shuffleobs(unique(X_[idx_label,:])), k = k);
    
    out = []
    
    for j = 1:k
        train_lab, vald_lab = dd[j]
        train_idx = Int64[]
        valid_idx = Int64[]

        for i = 1:size(X_,2)
            if findall(X_[idx_label,i] .== train_lab) != []
                push!(train_idx, i)
            else
                push!(valid_idx, i)
            end
        end
        
        push!(out,(X_[:,train_idx],X_[:,valid_idx]))
        
    end
    
    return out
end

I created two loops to check if train_lab was found in X_, and then get indices create a final list containing the folds as tuples. It seems to work but is a bit long…

Does anyone has a clean implementation of such function, or some suggestion to do it?

I usually use in with filter

vals = ["c","b"] # values of interest
arr = ["a","b","c","d"] # array with data
filter(x -> in(x, vals), arr) # select values of interest

But you could also do in.(arr, [vals]) to get a BitArray (array of true/false values) and use it for filtering: arr[in.(arr, [vals])]. Notice that the second argument for in.() is an array of arrays.

Once you know how to filter it should be easy to continue.

But if you have heterogeneous data it might be nicer to work with DataFrames which has a convenient groupby function to split a DataFrame into groups which you can iterate over and process as needed.

1 Like

Thanks I will try to modify the code and see how it goes. The double loop is actually not too bad as my dataset is small (does not take time to compute), but your suggestion may be nicer!

You don’t need to worry about loops in Julia (like you would in Python or R, for example). In Julia it’s all fast (as long as you follow the performance tips from documentation).

True! I spent a lot of time programming in Python lately and I guess I have this reflex right now of avoiding loops at all costs :slight_smile: