Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?
The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: https://github.com/JuliaML/MLDataUtils.jl/pull/16.
There isn’t native support for DataFrames but you would only need to define
getobs for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.
I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials
There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.
Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:
function partitionTrainTest(data, at = 0.7) n = nrow(data) idx = shuffle(1:n) train_idx = view(idx, 1:floor(Int, at*n)) test_idx = view(idx, (floor(Int, at*n)+1):n) data[train_idx,:], data[test_idx,:] end using RDatasets iris = dataset("datasets", "iris") train,test = partitionTrainTest(iris, 0.7) # 70% train
Given a dataframe
df, to pull out e.g. 5% as a training set you could presumably just do:
sample = randsubseq(1:size(df,1), 0.05) trainingset = df[sample, :] notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))] testset = df[notsample, :]
i.e. using the built-in
randsubseq function to efficiently get a random (ordered) subset
sample of the rows, without repetition, where each row is included with probability
Thank’s so far, at least i have a solution now.