[ANN] DataSplits.jl - data splitting for model selection

davide.crucitti · May 25, 2026, 10:46am

Hi all,

I’m pleased to announce DataSplits.jl , a new package for constructing train/test splits and cross-validation folds for model selection.

The motivation is simple: when benchmarking models, the split is often as important as the model. Random splits are not always appropriate, especially when observations are grouped, ordered in time, spatially or chemically structured, or otherwise dependent. In those cases, the splitting strategy should be explicit, reproducible, and easy to inspect.

DataSplits.jl provides a common interface for a range of splitting strategies through a single entry point, partition . It works with matrices, Tables, and MLUtils-compatible containers.

For example:

using DataSplits

# Diversity-based split: training set covers the feature space
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)

# Cover features and response jointly
res = partition(X, SPXYSplit(); target = y, train = 80, test = 20)

# Group-aware k-fold: no group appears in more than one fold
cvs = partition(X, GroupKFold(5); groups = patient_ids)

for (X_tr, X_te) in splitview(cvs, X)
    fit!(model, X_tr)
    evaluate(model, X_te)
end

# Train / validation / test in one call
res = partition(
    X,
    RandomSplit(),
    KennardStoneSplit();
    train = 70,
    validation = 10,
    test = 20,
)

The package also integrates with MLJ.jl through rowpairs :

using MLJ

cvs = partition(X, StratifiedKFold(5); target = y)

mach = machine(model, X, y)
evaluate!(mach; resampling = rowpairs(cvs), measure = accuracy)

and with Flux.jl through trainview , validationview , and testview , which return lazy views:

using Flux

res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)

loader = Flux.DataLoader(
    trainview(res, X, y);
    batchsize = 64,
    shuffle = true,
)

Currently implemented strategies include:

random and stratified splits
k-fold, stratified k-fold, grouped k-fold, and repeated variants
time-aware strategies such as TimeSeriesSplit , BlockedCV , and PurgedKFold
nested cross-validation
distance- and diversity-based methods such as Kennard–Stone, SPXY, MDKS, and OptiSim
lazy variants of the distance-based methods for larger datasets

The package has just been submitted to the General registry, so it should become installable with:

] add DataSplits

There is still the usual registration waiting period before merging, so feedback on the package name is also welcome.

More importantly, I would appreciate feedback on the API before committing to a stable 1.0 interface. If anything feels awkward, inconsistent, or missing from your workflow, I would much rather hear about it now.

Docs: DataSplits.jl · DataSplits.jl
GitHub: GitHub - davide-grheco/DataSplits.jl: A Julia package implementing several data splitting algorithms · GitHub

Topic		Replies	Views
Python train_test_split vs Julia splitobs Data	11	3832	May 5, 2017
Simple tool for train test split Machine Learning	12	12009	March 20, 2020
A partition and a crossValidation function over arbitrary number of n-dimensional arrays Machine Learning	0	519	April 15, 2021
API‑design feedback for DataSplits.jl (dataset‑splitting package) General Usage package , api	0	60	July 30, 2025
StratifiedKfold General Usage question	7	1857	July 6, 2020

[ANN] DataSplits.jl - data splitting for model selection

Related topics