Hi all,
I’m pleased to announce DataSplits.jl , a new package for constructing train/test splits and cross-validation folds for model selection.
The motivation is simple: when benchmarking models, the split is often as important as the model. Random splits are not always appropriate, especially when observations are grouped, ordered in time, spatially or chemically structured, or otherwise dependent. In those cases, the splitting strategy should be explicit, reproducible, and easy to inspect.
DataSplits.jl provides a common interface for a range of splitting strategies through a single entry point, partition . It works with matrices, Tables, and MLUtils-compatible containers.
For example:
using DataSplits
# Diversity-based split: training set covers the feature space
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)
# Cover features and response jointly
res = partition(X, SPXYSplit(); target = y, train = 80, test = 20)
# Group-aware k-fold: no group appears in more than one fold
cvs = partition(X, GroupKFold(5); groups = patient_ids)
for (X_tr, X_te) in splitview(cvs, X)
fit!(model, X_tr)
evaluate(model, X_te)
end
# Train / validation / test in one call
res = partition(
X,
RandomSplit(),
KennardStoneSplit();
train = 70,
validation = 10,
test = 20,
)
The package also integrates with MLJ.jl through rowpairs :
using MLJ
cvs = partition(X, StratifiedKFold(5); target = y)
mach = machine(model, X, y)
evaluate!(mach; resampling = rowpairs(cvs), measure = accuracy)
and with Flux.jl through trainview , validationview , and testview , which return lazy views:
using Flux
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
loader = Flux.DataLoader(
trainview(res, X, y);
batchsize = 64,
shuffle = true,
)
Currently implemented strategies include:
- random and stratified splits
- k-fold, stratified k-fold, grouped k-fold, and repeated variants
- time-aware strategies such as
TimeSeriesSplit,BlockedCV, andPurgedKFold - nested cross-validation
- distance- and diversity-based methods such as Kennard–Stone, SPXY, MDKS, and OptiSim
- lazy variants of the distance-based methods for larger datasets
The package has just been submitted to the General registry, so it should become installable with:
] add DataSplits
There is still the usual registration waiting period before merging, so feedback on the package name is also welcome.
More importantly, I would appreciate feedback on the API before committing to a stable 1.0 interface. If anything feels awkward, inconsistent, or missing from your workflow, I would much rather hear about it now.
Docs: DataSplits.jl · DataSplits.jl
GitHub: GitHub - davide-grheco/DataSplits.jl: A Julia package implementing several data splitting algorithms · GitHub