Does anyone know a tool that enables me to split a dataframe in to a training set and a test set?

# Simple tool for train test split

**tbreloff**#2

The future of iteration and dataset partitioning is soon to be merged into MLDataUtils: https://github.com/JuliaML/MLDataUtils.jl/pull/16.

There isn’t native support for DataFrames but you would only need to define `nobs`

/`getobs`

for an AbstractDataFrame (or AbstractTable?) and all this functionality will be available.

**Evizero**#3

I secretly wish this question would not come up for at least a week until I found the time to write the docs and tutorials

There will be DataFrame support, or at least a tutorial that shows how to use DataFrames, but I need some time to finish this up.

Until then I would advice to simply do it manually. I assume each row represents an observation? If so you could use this little snipped:

```
function partitionTrainTest(data, at = 0.7)
n = nrow(data)
idx = shuffle(1:n)
train_idx = view(idx, 1:floor(Int, at*n))
test_idx = view(idx, (floor(Int, at*n)+1):n)
data[train_idx,:], data[test_idx,:]
end
using RDatasets
iris = dataset("datasets", "iris")
train,test = partitionTrainTest(iris, 0.7) # 70% train
```

**stevengj**#4

Given a dataframe `df`

, to pull out e.g. 5% as a training set you could presumably just do:

```
sample = randsubseq(1:size(df,1), 0.05)
trainingset = df[sample, :]
notsample = [i for i in 1:size(df,1) if isempty(searchsorted(sample, i))]
testset = df[notsample, :]
```

i.e. using the built-in `randsubseq`

function to efficiently get a random (ordered) subset `sample`

of the rows, without repetition, where each row is included with probability `0.05`

.

**microgold**#6

I used the MLDataUtils package and finally figured out how to do this with dataframes into a Random Forest:

Here is my sample from my notebook:

```
using MLDataUtils
#convert the dataframes into arrays
x = convert(Array,iris[1:4])
y = Array{Int64}(iris[:SpeciesEnumerator])
# shuffle the data so its not in order when we split it up
Xs, Ys = shuffleobs((transpose(x), y))
#now split the data into training sets and validation sets
(X_train1, y_train1), (X_test1, y_test1) = splitobs((Xs, Ys); at = 0.67)
# need to convert the split data back into arrays
x_train = Array(transpose(X_train1))
y_train = Array(y_train1)
x_test = Array(transpose(X_test1))
y_test = Array(y_test1)
# Create a random forest model
model = RandomForestClassifier(n_subfeatures = 3, n_trees = 50, partial_sampling=0.7, max_depth = 4)
# Train the model on the dataset
DecisionTree.fit!(model, x_train, y_train)
# Apply the trained model to the test features data set
prediction = convert(Array{Int64,1}, DecisionTree.predict(model, x_test))
```