Simple tool for train test split


#1

Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?


#2

The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: https://github.com/JuliaML/MLDataUtils.jl/pull/16.

There isn’t native support for DataFrames but you would only need to define nobs/getobs for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.


#3

I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials :slight_smile:

There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.

Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:

function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

using RDatasets
iris = dataset("datasets", "iris")
train,test = partitionTrainTest(iris, 0.7) # 70% train

#4

Given a dataframe df, to pull out e.g. 5% as a training set you could presumably just do:

sample = randsubseq(1:size(df,1), 0.05)
trainingset = df[sample, :]
notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))]
testset = df[notsample, :]

i.e. using the built-in randsubseq function to efficiently get a random (ordered) subset sample of the rows, without repetition, where each row is included with probability 0.05.


#5

Thank’s so far, at least i have a solution now.


#6

I used the MLDataUtils package and finally figured out how to do this with dataframes into a Random Forest:

Here is my sample from my notebook:

using MLDataUtils

#convert the dataframes into arrays
x = convert(Array,iris[1:4])
y = Array{Int64}(iris[:SpeciesEnumerator])

# shuffle the data so its not in order when we split it up
Xs, Ys = shuffleobs((transpose(x), y))

#now split the data into training sets and validation sets
(X_train1, y_train1), (X_test1, y_test1) = splitobs((Xs, Ys); at = 0.67)
    
 # need to convert the split data back into arrays

    x_train = Array(transpose(X_train1))
    y_train = Array(y_train1)
    x_test = Array(transpose(X_test1))
    y_test = Array(y_test1)

 # Create a random forest model
model = RandomForestClassifier(n_subfeatures = 3, n_trees = 50, partial_sampling=0.7, max_depth = 4)

# Train the model on the dataset 
DecisionTree.fit!(model, x_train, y_train)

# Apply the trained model to the test features data set 
prediction = convert(Array{Int64,1}, DecisionTree.predict(model, x_test))

#7

This is really helpful for novice to Julia like me.