Python train_test_split vs Julia splitobs

JuliaCaesar · March 16, 2017, 2:24pm

Hi,

I have difficulties switching from Python’s train_test_split() to Julia’s splitdata() .
After many tests, I wonder if those commands are similar.

Is it possible for someone to give some explanations about them ?

rdeits · March 16, 2017, 3:52pm

Can you please be a little more specific? Which Python package provides the train_test_split() function? And which Julia package provides your splitdata()?

JuliaCaesar · March 16, 2017, 7:23pm

Python : from sklearn.model_selection import train_test_split
Julia : using MLDataUtils

Evizero · March 16, 2017, 11:02pm

Hi, I am one of the authors of MLDataUtils. Sorry that you are experimenting difficulties.

First, let me say that we don’t attempt to resemble the API of other frameworks since the language is just very different.

Aside from that MLDataUtils is undergoing a huge overhaul which code-wise is done, but not yet documented and merged into master. For example splitdata will de deprecated in favour of splitobs.

The closest thing to a documentation for that is here: https://github.com/JuliaML/MLDataUtils.jl/blob/dev/src/accesspattern/datasubset.jl#L543-L604

JuliaCaesar · March 17, 2017, 12:56am

No need to be sorry, I’m a bad coder and a bad mathematician .
Thanks for these informations, it’s useful for my work.
I’ll check that.
–EDIT: any idea about the time of the merge ?

Evizero · March 17, 2017, 8:55am

The main blocking thing is me sitting down and finishing the documentation. So hopefully soon.

JuliaCaesar · April 5, 2017, 5:27pm

Hi,

I see no update for the MLDataUtils package, but it seems that the changes have been pushed into master. I don’t have the splitobs command.

Evizero · April 5, 2017, 5:43pm

I won’t tag a new version until I finish the move of the access pattern to https://github.com/JuliaML/MLDataPattern.jl, which will be a new backend for MLDataUtils. MLDataUtils will turn into a meta package.

(I won’t tag an intermediate version because I am still breaking master here and there)

JuliaCaesar · April 5, 2017, 5:51pm

Ok, so I’ll try Pkg.update(“MLDataUtils”) from time to time.
Thanks for your answer.

JuliaCaesar · May 4, 2017, 6:50am

Here’s a Python block :


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt

rot = np.array([[0.94, -0.34], [0.34, 0.94]])
sca = np.array([[3.4, 0], [0, 2]])

np.random.seed(150)
c1d = (np.random.randn(100,2)).dot(sca).dot(rot)

c2d1 = np.random.randn(25,2)+[-10, 2]
c2d2 = np.random.randn(25,2)+[-7, -2]
c2d3 = np.random.randn(25,2)+[-2, -6]
c2d4 = np.random.randn(25,2)+[5, -7]

data = np.concatenate((c1d, c2d1, c2d2, c2d3, c2d4))

l1c = np.ones(100, dtype=int)
l2c = np.zeros(100, dtype=int)
labels = np.concatenate((l1c, l2c))

cm = np.array(['r','g'])
plt.scatter(data[:,0],data[:,1],c=cm[labels],s=50,edgecolors='none')
plt.show()

from sklearn.model_selection import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(data, labels, test_size=0.33)
plt.scatter(X_train1[:,0],X_train1[:,1],c=cm[y_train1],s=50,edgecolors='none')
plt.scatter(X_test1[:,0],X_test1[:,1],c='none',s=50,edgecolors=cm[y_test1])
plt.show()

Here’s the wannabe Julia counterpart :

rot = [0.94 -0.34; 0.34 0.94]
sca = [3.4 0; 0 2]

srand(150)
c1d = randn(100,2) * sca * rot
c2d1 = randn(25,2) .+ [-10 2]
c2d2 = randn(25,2) .+ [-7 -2]
c2d3 = randn(25,2) .+ [-2 -6]
c2d4 = randn(25,2) .+ [5 -7]

data = cat(1, c1d, c2d1, c2d2, c2d3, c2d4)

l1c = ones(Int, 100)
l2c = zeros(Int, 100)
labels = cat(1, l1c, l2c)

using Plots
scatter(data[:,1], data[:,2], color = [:red :green], groups = labels, markersize=5, markerstrokewidth = 0)

using DataFrames, DiscriminantAnalysis;
using MLDataUtils

(X_train1, y_train1), (X_test1, y_test1) = splitobs((transpose(data), labels); at = 0.67)

scatter(X_train1[1,:],X_train1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)
scatter(X_test1[1,:],X_test1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)

I obviously don’t get the same result.

Looking at the split, I see that anyway the y_train1 and y_test1 aren’t related between the two languages, Python one being “randomized” 0 and 1.

–EDIT: Sorry for the text format, something is wrong with the site’s editor.

Evizero · May 4, 2017, 9:20pm

Not sure what you are asking.

Are you asking why splitobs doesn’t perform random assignment? The function splitobs performs a static split .

here three currently available approaches. Maybe one of them suits your needs:

The first one does not care about the target vector. In fact it doesn’t know if the tuple even contains targets or other features. it just splits the data at the split point

julia> set1, set2 = splitobs((1:6, [:a,:a,:a,:a,:b,:b]), at = 0.6)
(([1,2,3,4],Symbol[:a,:a,:a,:a]),([5,6],Symbol[:b,:b]))

The second one does a random assignment from observations to set1 or set2. It also doesn’t care if there exist targets

julia> set1, set2 = splitobs(shuffleobs((1:6, [:a,:a,:a,:a,:b,:b])), at = 0.6)
(([2,3,6,5],Symbol[:a,:a,:b,:b]),([4,1],Symbol[:a,:a]))

The third one takes the target distribution into account and tries to preserve it for each of the resulting sets

julia> set1, set2 = stratifiedobs((1:6, [:a,:a,:a,:a,:b,:b]), p = 0.6)
(([5,4,2],Symbol[:b,:a,:a]),([1,3,6],Symbol[:a,:a,:b]))

All of this is documented: MLDataPattern.jl’s documentation — MLDataPattern.jl 0.1 documentation

There may be a more beginner friendly convenience API sooner or later, but right now my focus is on functionality and a flexible low-level API

JuliaCaesar · May 5, 2017, 10:49pm

I’m blindly converting a Python program into a Julia one.
I thought split* from each language were acting the same.
Anyway, I’m on the learning curve so I guess that it’s good for me if things aren’t that simple.
I’ll look at your recommendations, I thank you for them.

Topic		Replies	Views
Simple tool for train test split Machine Learning	12	11738	March 20, 2020
Problems with Lathe TrainTestSplit New to Julia	11	1484	September 18, 2020
Why Julia machine learning is so unfriendly? Very "unsmooth" experience from foolish guy Machine Learning first-steps	15	3504	March 3, 2019
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4927	April 29, 2020
How can I use the ScikitLearn.utils module? Machine Learning	0	252	April 23, 2022

Python train_test_split vs Julia splitobs

Related topics