Python train_test_split vs Julia splitobs


#1

Hi,

I have difficulties switching from Python’s train_test_split() to Julia’s splitdata() .
After many tests, I wonder if those commands are similar.

Is it possible for someone to give some explanations about them ?


#2

Can you please be a little more specific? Which Python package provides the train_test_split() function? And which Julia package provides your splitdata()?


#3

Python : from sklearn.model_selection import train_test_split
Julia : using MLDataUtils


#4

Hi, I am one of the authors of MLDataUtils. Sorry that you are experimenting difficulties.

First, let me say that we don’t attempt to resemble the API of other frameworks since the language is just very different.

Aside from that MLDataUtils is undergoing a huge overhaul which code-wise is done, but not yet documented and merged into master. For example splitdata will de deprecated in favour of splitobs.

The closest thing to a documentation for that is here: https://github.com/JuliaML/MLDataUtils.jl/blob/dev/src/accesspattern/datasubset.jl#L543-L604


#5

No need to be sorry, I’m a bad coder and a bad mathematician :slight_smile: .
Thanks for these informations, it’s useful for my work.
I’ll check that.
–EDIT: any idea about the time of the merge ?


#6

The main blocking thing is me sitting down and finishing the documentation. So hopefully soon.


#7

Hi,

I see no update for the MLDataUtils package, but it seems that the changes have been pushed into master. I don’t have the splitobs command.


#8

I won’t tag a new version until I finish the move of the access pattern to https://github.com/JuliaML/MLDataPattern.jl, which will be a new backend for MLDataUtils. MLDataUtils will turn into a meta package.

(I won’t tag an intermediate version because I am still breaking master here and there)


#9

Ok, so I’ll try Pkg.update(“MLDataUtils”) from time to time.
Thanks for your answer.


#10

Here’s a Python block :


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt

rot = np.array([[0.94, -0.34], [0.34, 0.94]])
sca = np.array([[3.4, 0], [0, 2]])

np.random.seed(150)
c1d = (np.random.randn(100,2)).dot(sca).dot(rot)

c2d1 = np.random.randn(25,2)+[-10, 2]
c2d2 = np.random.randn(25,2)+[-7, -2]
c2d3 = np.random.randn(25,2)+[-2, -6]
c2d4 = np.random.randn(25,2)+[5, -7]

data = np.concatenate((c1d, c2d1, c2d2, c2d3, c2d4))

l1c = np.ones(100, dtype=int)
l2c = np.zeros(100, dtype=int)
labels = np.concatenate((l1c, l2c))

cm = np.array(['r','g'])
plt.scatter(data[:,0],data[:,1],c=cm[labels],s=50,edgecolors='none')
plt.show()

from sklearn.model_selection import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(data, labels, test_size=0.33)
plt.scatter(X_train1[:,0],X_train1[:,1],c=cm[y_train1],s=50,edgecolors='none')
plt.scatter(X_test1[:,0],X_test1[:,1],c='none',s=50,edgecolors=cm[y_test1])
plt.show()

Here’s the wannabe Julia counterpart :

rot = [0.94 -0.34; 0.34 0.94]
sca = [3.4 0; 0 2]

srand(150)
c1d = randn(100,2) * sca * rot
c2d1 = randn(25,2) .+ [-10 2]
c2d2 = randn(25,2) .+ [-7 -2]
c2d3 = randn(25,2) .+ [-2 -6]
c2d4 = randn(25,2) .+ [5 -7]

data = cat(1, c1d, c2d1, c2d2, c2d3, c2d4)

l1c = ones(Int, 100)
l2c = zeros(Int, 100)
labels = cat(1, l1c, l2c)

using Plots
scatter(data[:,1], data[:,2], color = [:red :green], groups = labels, markersize=5, markerstrokewidth = 0)

using DataFrames, DiscriminantAnalysis;
using MLDataUtils

(X_train1, y_train1), (X_test1, y_test1) = splitobs((transpose(data), labels); at = 0.67)

scatter(X_train1[1,:],X_train1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)
scatter(X_test1[1,:],X_test1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)


I obviously don’t get the same result.

Looking at the split, I see that anyway the y_train1 and y_test1 aren’t related between the two languages, Python one being “randomized” 0 and 1.

–EDIT: Sorry for the text format, something is wrong with the site’s editor.


#11

Not sure what you are asking.

Are you asking why splitobs doesn’t perform random assignment? The function splitobs performs a static split .

here three currently available approaches. Maybe one of them suits your needs:

  • The first one does not care about the target vector. In fact it doesn’t know if the tuple even contains targets or other features. it just splits the data at the split point
julia> set1, set2 = splitobs((1:6, [:a,:a,:a,:a,:b,:b]), at = 0.6)
(([1,2,3,4],Symbol[:a,:a,:a,:a]),([5,6],Symbol[:b,:b]))
  • The second one does a random assignment from observations to set1 or set2. It also doesn’t care if there exist targets
julia> set1, set2 = splitobs(shuffleobs((1:6, [:a,:a,:a,:a,:b,:b])), at = 0.6)
(([2,3,6,5],Symbol[:a,:a,:b,:b]),([4,1],Symbol[:a,:a]))
  • The third one takes the target distribution into account and tries to preserve it for each of the resulting sets
julia> set1, set2 = stratifiedobs((1:6, [:a,:a,:a,:a,:b,:b]), p = 0.6)
(([5,4,2],Symbol[:b,:a,:a]),([1,3,6],Symbol[:a,:a,:b]))

All of this is documented: http://mldatapatternjl.readthedocs.io/en/latest/index.html

There may be a more beginner friendly convenience API sooner or later, but right now my focus is on functionality and a flexible low-level API


#12

I’m blindly converting a Python program into a Julia one.
I thought split* from each language were acting the same.
Anyway, I’m on the learning curve so I guess that it’s good for me if things aren’t that simple.
I’ll look at your recommendations, I thank you for them.