Simple tool for train test split


#1

Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?


#2

The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: https://github.com/JuliaML/MLDataUtils.jl/pull/16.

There isn’t native support for DataFrames but you would only need to define nobs/getobs for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.


#3

I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials :slight_smile:

There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.

Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:

function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

using RDatasets
iris = dataset("datasets", "iris")
train,test = partitionTrainTest(iris, 0.7) # 70% train

#4

Given a dataframe df, to pull out e.g. 5% as a training set you could presumably just do:

sample = randsubseq(1:size(df,1), 0.05)
trainingset = df[sample, :]
notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))]
testset = df[notsample, :]

i.e. using the built-in randsubseq function to efficiently get a random (ordered) subset sample of the rows, without repetition, where each row is included with probability 0.05.


#5

Thank’s so far, at least i have a solution now.