Simple tool for train test split


#1

Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?


#2

The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: https://github.com/JuliaML/MLDataUtils.jl/pull/16.

There isn’t native support for DataFrames but you would only need to define nobs/getobs for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.


#3

I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials :slight_smile:

There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.

Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:

function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

using RDatasets
iris = dataset("datasets", "iris")
train,test = partitionTrainTest(iris, 0.7) # 70% train

#4

Given a dataframe df, to pull out e.g. 5% as a training set you could presumably just do:

sample = randsubseq(1:size(df,1), 0.05)
trainingset = df[sample, :]
notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))]
testset = df[notsample, :]

i.e. using the built-in randsubseq function to efficiently get a random (ordered) subset sample of the rows, without repetition, where each row is included with probability 0.05.


#5

Thank’s so far, at least i have a solution now.


#6

I used the MLDataUtils package and finally figured out how to do this with dataframes into a Random Forest:

Here is my sample from my notebook:

using MLDataUtils

#convert the dataframes into arrays
x = convert(Array,iris[1:4])
y = Array{Int64}(iris[:SpeciesEnumerator])

# shuffle the data so its not in order when we split it up
Xs, Ys = shuffleobs((transpose(x), y))

#now split the data into training sets and validation sets
(X_train1, y_train1), (X_test1, y_test1) = splitobs((Xs, Ys); at = 0.67)
    
 # need to convert the split data back into arrays

    x_train = Array(transpose(X_train1))
    y_train = Array(y_train1)
    x_test = Array(transpose(X_test1))
    y_test = Array(y_test1)

 # Create a random forest model
model = RandomForestClassifier(n_subfeatures = 3, n_trees = 50, partial_sampling=0.7, max_depth = 4)

# Train the model on the dataset 
DecisionTree.fit!(model, x_train, y_train)

# Apply the trained model to the test features data set 
prediction = convert(Array{Int64,1}, DecisionTree.predict(model, x_test))

#7

This is really helpful for novice to Julia like me.


#8

Plug and play snippet for train / test data sample split if your data is in the format of a multi-dimensional array.
@Evizero has a snippet above if your data is in DataFrame format.

"""
partitionTrainTest(data, y_column, train_perc = 0.7).
\n Returns X_train, y_train, X_test, y_test given inputs data (multi-dimensional array), 
\n train_perc (percent of data in train sample) and y_column (column location of y values).
"""
function partitionTrainTest(data, y_column, train_perc = 0.7)
    n = size(data,1)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx, 1:end .!=y_column], data[train_idx, y_column],  data[test_idx, 1:end .!=y_column], data[test_idx, y_column]
end

#9

A 3-partition split is more useful than 2-partition if you ever want to evaluate your model which is convincing to anyone else. It’s not convincing on a 2-partition set that your predictions are any good.
Furthermore sometimes even 4 or more partitions are needed depending (typically seen in deep learning models). This is because new data is needed for unbiased error estimation of the final model after bias and variance errors have been seemingly learned to be minimized by the learning algorithm and your hyperparameters such as L2 regularization. You cannot use the “test” partition of a 2-partition split on which you chose the best L2 or polynomial degree or whatever, to evaluate your model honestly (without bias).

How would you make a 3 partition split: train/dev/test?

Question 2: Is the implementation going to work on big data? We don’t generally have the RAM to load an entire dataset once, never mind twice, so please use care to avoid making a duplicate memory allocation of original data (monolithic) as well as partitioned data (2, 3, or more partitions as I described).

Thank you!


#10

I am frankly worried due to your choice of function name baked in here, that only 2 partitions are going to be possible when that documentation is complete, i.e., train, test. Are you guys going to allow 3 partitions or more (preferably more)? What does the library author plan to name the function that supports slicing into more than 2 data partitions? (See my other post for more background on partitioning.)


#11

Check out MLDataUtils.jl, particularly splitobs. See this page for a 3-partition example.

The post you are replying to is almost 2 years old, and there’s been some progress since then :wink:


#12

Thank you for the reply and for reading the docs

I would mark your post as an answer to this ancient thread if I could