Group KFold?

charlesll · February 25, 2019, 6:29am

Hi,

I am trying to implement a Group KFold function for my problem, where I need to slit the dataset as a function of some label.

There is no function like that in MLDataUtils or MLbase to my knowledge, so I am trying to implement it manually but it’s actually a mess as I can’t find any equivalent of the Pandas or Numpy isin() functions…

(how do we actually grab all values in one array equal to those in another in Julia???)

[edit] Right now I achieved to write the following function:

function gkfolds(X_, idx_label, k = 5)

    dd = kfolds(shuffleobs(unique(X_[idx_label,:])), k = k);
    
    out = []
    
    for j = 1:k
        train_lab, vald_lab = dd[j]
        train_idx = Int64[]
        valid_idx = Int64[]

        for i = 1:size(X_,2)
            if findall(X_[idx_label,i] .== train_lab) != []
                push!(train_idx, i)
            else
                push!(valid_idx, i)
            end
        end
        
        push!(out,(X_[:,train_idx],X_[:,valid_idx]))
        
    end
    
    return out
end

I created two loops to check if train_lab was found in X_, and then get indices create a final list containing the folds as tuples. It seems to work but is a bit long…

Does anyone has a clean implementation of such function, or some suggestion to do it?

ValdarT · February 26, 2019, 12:59pm

I usually use in with filter

vals = ["c","b"] # values of interest
arr = ["a","b","c","d"] # array with data
filter(x -> in(x, vals), arr) # select values of interest

But you could also do in.(arr, [vals]) to get a BitArray (array of true/false values) and use it for filtering: arr[in.(arr, [vals])]. Notice that the second argument for in.() is an array of arrays.

Once you know how to filter it should be easy to continue.

But if you have heterogeneous data it might be nicer to work with DataFrames which has a convenient groupby function to split a DataFrame into groups which you can iterate over and process as needed.

charlesll · February 27, 2019, 1:44am

Thanks I will try to modify the code and see how it goes. The double loop is actually not too bad as my dataset is small (does not take time to compute), but your suggestion may be nicer!

ValdarT · February 27, 2019, 4:13am

You don’t need to worry about loops in Julia (like you would in Python or R, for example). In Julia it’s all fast (as long as you follow the performance tips from documentation).

charlesll · February 27, 2019, 10:30pm

True! I spent a lot of time programming in Python lately and I guess I have this reflex right now of avoiding loops at all costs

Topic		Replies	Views
StratifiedKfold General Usage question	7	1714	July 6, 2020
HELP: Implementing K-means from scratch with Julia General Usage question	4	1500	February 6, 2020
Is array/vector handling this hard? New to Julia	14	776	September 20, 2023
Checking for unique rows in classification New to Julia dataframes	4	569	August 11, 2022
Mix of index-based and slice based indexing across different dimensions General Usage arrays	1	193	August 8, 2023

Group KFold?

Related topics