Simple tool for train test split

klezmen · November 21, 2016, 3:16pm

Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?

tbreloff · November 21, 2016, 4:18pm

The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: Data Access Pattern in 0.5 by Evizero · Pull Request #16 · JuliaML/MLDataUtils.jl · GitHub.

There isn’t native support for DataFrames but you would only need to define nobs/getobs for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.

Evizero · November 21, 2016, 4:29pm

I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials

There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.

Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:

function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

using RDatasets
iris = dataset("datasets", "iris")
train,test = partitionTrainTest(iris, 0.7) # 70% train

stevengj · November 21, 2016, 4:32pm

Given a dataframe df, to pull out e.g. 5% as a training set you could presumably just do:

sample = randsubseq(1:size(df,1), 0.05)
trainingset = df[sample, :]
notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))]
testset = df[notsample, :]

i.e. using the built-in randsubseq function to efficiently get a random (ordered) subset sample of the rows, without repetition, where each row is included with probability 0.05.

klezmen · November 21, 2016, 10:45pm

Thank’s so far, at least i have a solution now.

microgold · September 16, 2018, 5:03pm

I used the MLDataUtils package and finally figured out how to do this with dataframes into a Random Forest:

Here is my sample from my notebook:

using MLDataUtils

#convert the dataframes into arrays
x = convert(Array,iris[1:4])
y = Array{Int64}(iris[:SpeciesEnumerator])

# shuffle the data so its not in order when we split it up
Xs, Ys = shuffleobs((transpose(x), y))

#now split the data into training sets and validation sets
(X_train1, y_train1), (X_test1, y_test1) = splitobs((Xs, Ys); at = 0.67)
    
 # need to convert the split data back into arrays

    x_train = Array(transpose(X_train1))
    y_train = Array(y_train1)
    x_test = Array(transpose(X_test1))
    y_test = Array(y_test1)

 # Create a random forest model
model = RandomForestClassifier(n_subfeatures = 3, n_trees = 50, partial_sampling=0.7, max_depth = 4)

# Train the model on the dataset 
DecisionTree.fit!(model, x_train, y_train)

# Apply the trained model to the test features data set 
prediction = convert(Array{Int64,1}, DecisionTree.predict(model, x_test))

Ajaychat3 · September 18, 2018, 1:20pm

This is really helpful for novice to Julia like me.

Mikkel-Holm · September 25, 2018, 10:04am

Plug and play snippet for train / test data sample split if your data is in the format of a multi-dimensional array.
@Evizero has a snippet above if your data is in DataFrame format.

"""
partitionTrainTest(data, y_column, train_perc = 0.7).
\n Returns X_train, y_train, X_test, y_test given inputs data (multi-dimensional array), 
\n train_perc (percent of data in train sample) and y_column (column location of y values).
"""
function partitionTrainTest(data, y_column, train_perc = 0.7)
    n = size(data,1)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx, 1:end .!=y_column], data[train_idx, y_column],  data[test_idx, 1:end .!=y_column], data[test_idx, y_column]
end

Geoffrey_Anderson · October 1, 2018, 6:59pm

A 3-partition split is more useful than 2-partition if you ever want to evaluate your model which is convincing to anyone else. It’s not convincing on a 2-partition set that your predictions are any good.
Furthermore sometimes even 4 or more partitions are needed depending (typically seen in deep learning models). This is because new data is needed for unbiased error estimation of the final model after bias and variance errors have been seemingly learned to be minimized by the learning algorithm and your hyperparameters such as L2 regularization. You cannot use the “test” partition of a 2-partition split on which you chose the best L2 or polynomial degree or whatever, to evaluate your model honestly (without bias).

How would you make a 3 partition split: train/dev/test?

Question 2: Is the implementation going to work on big data? We don’t generally have the RAM to load an entire dataset once, never mind twice, so please use care to avoid making a duplicate memory allocation of original data (monolithic) as well as partitioned data (2, 3, or more partitions as I described).

Thank you!

Geoffrey_Anderson · October 1, 2018, 7:02pm

I am frankly worried due to your choice of function name baked in here, that only 2 partitions are going to be possible when that documentation is complete, i.e., train, test. Are you guys going to allow 3 partitions or more (preferably more)? What does the library author plan to name the function that supports slicing into more than 2 data partitions? (See my other post for more background on partitioning.)

platawiec · October 1, 2018, 9:14pm

Check out MLDataUtils.jl, particularly splitobs. See this page for a 3-partition example.

The post you are replying to is almost 2 years old, and there’s been some progress since then

Evizero · October 1, 2018, 9:51pm

Thank you for the reply and for reading the docs

I would mark your post as an answer to this ancient thread if I could

microgold · March 20, 2020, 1:09am

The third set was in a separate file, but i do see your point!

Topic		Replies	Views
Python train_test_split vs Julia splitobs Data	11	3713	May 5, 2017
Proposal: Using a "mask" for DataFrames and ModelMatrices Statistics proposal	3	1404	September 8, 2017
How to sample a Data frame General Usage	10	6935	December 31, 2019
A partition and a crossValidation function over arbitrary number of n-dimensional arrays Machine Learning	0	490	April 15, 2021
ANN: MLDataPattern.jl Machine Learning package , announcement	6	1258	April 21, 2017

Simple tool for train test split

Related topics