Hi,
I have difficulties switching from Python’s train_test_split() to Julia’s splitdata() .
After many tests, I wonder if those commands are similar.
Is it possible for someone to give some explanations about them ?
Hi,
I have difficulties switching from Python’s train_test_split() to Julia’s splitdata() .
After many tests, I wonder if those commands are similar.
Is it possible for someone to give some explanations about them ?
Can you please be a little more specific? Which Python package provides the train_test_split()
function? And which Julia package provides your splitdata()
?
Python : from sklearn.model_selection import train_test_split
Julia : using MLDataUtils
Hi, I am one of the authors of MLDataUtils. Sorry that you are experimenting difficulties.
First, let me say that we don’t attempt to resemble the API of other frameworks since the language is just very different.
Aside from that MLDataUtils is undergoing a huge overhaul which code-wise is done, but not yet documented and merged into master
. For example splitdata
will de deprecated in favour of splitobs
.
The closest thing to a documentation for that is here: https://github.com/JuliaML/MLDataUtils.jl/blob/dev/src/accesspattern/datasubset.jl#L543-L604
No need to be sorry, I’m a bad coder and a bad mathematician .
Thanks for these informations, it’s useful for my work.
I’ll check that.
–EDIT: any idea about the time of the merge ?
The main blocking thing is me sitting down and finishing the documentation. So hopefully soon.
Hi,
I see no update for the MLDataUtils package, but it seems that the changes have been pushed into master. I don’t have the splitobs command.
I won’t tag a new version until I finish the move of the access pattern to https://github.com/JuliaML/MLDataPattern.jl, which will be a new backend for MLDataUtils. MLDataUtils will turn into a meta package.
(I won’t tag an intermediate version because I am still breaking master here and there)
Ok, so I’ll try Pkg.update(“MLDataUtils”) from time to time.
Thanks for your answer.
Here’s a Python block :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
rot = np.array([[0.94, -0.34], [0.34, 0.94]])
sca = np.array([[3.4, 0], [0, 2]])
np.random.seed(150)
c1d = (np.random.randn(100,2)).dot(sca).dot(rot)
c2d1 = np.random.randn(25,2)+[-10, 2]
c2d2 = np.random.randn(25,2)+[-7, -2]
c2d3 = np.random.randn(25,2)+[-2, -6]
c2d4 = np.random.randn(25,2)+[5, -7]
data = np.concatenate((c1d, c2d1, c2d2, c2d3, c2d4))
l1c = np.ones(100, dtype=int)
l2c = np.zeros(100, dtype=int)
labels = np.concatenate((l1c, l2c))
cm = np.array(['r','g'])
plt.scatter(data[:,0],data[:,1],c=cm[labels],s=50,edgecolors='none')
plt.show()
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(data, labels, test_size=0.33)
plt.scatter(X_train1[:,0],X_train1[:,1],c=cm[y_train1],s=50,edgecolors='none')
plt.scatter(X_test1[:,0],X_test1[:,1],c='none',s=50,edgecolors=cm[y_test1])
plt.show()
Here’s the wannabe Julia counterpart :
rot = [0.94 -0.34; 0.34 0.94]
sca = [3.4 0; 0 2]
srand(150)
c1d = randn(100,2) * sca * rot
c2d1 = randn(25,2) .+ [-10 2]
c2d2 = randn(25,2) .+ [-7 -2]
c2d3 = randn(25,2) .+ [-2 -6]
c2d4 = randn(25,2) .+ [5 -7]
data = cat(1, c1d, c2d1, c2d2, c2d3, c2d4)
l1c = ones(Int, 100)
l2c = zeros(Int, 100)
labels = cat(1, l1c, l2c)
using Plots
scatter(data[:,1], data[:,2], color = [:red :green], groups = labels, markersize=5, markerstrokewidth = 0)
using DataFrames, DiscriminantAnalysis;
using MLDataUtils
(X_train1, y_train1), (X_test1, y_test1) = splitobs((transpose(data), labels); at = 0.67)
scatter(X_train1[1,:],X_train1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)
scatter(X_test1[1,:],X_test1[2,:],color = [:blue, :red], markersize=5, markerstrokewidth = 0)
I obviously don’t get the same result.
Looking at the split, I see that anyway the y_train1 and y_test1 aren’t related between the two languages, Python one being “randomized” 0 and 1.
–EDIT: Sorry for the text format, something is wrong with the site’s editor.
Not sure what you are asking.
Are you asking why splitobs
doesn’t perform random assignment? The function splitobs
performs a static split .
here three currently available approaches. Maybe one of them suits your needs:
julia> set1, set2 = splitobs((1:6, [:a,:a,:a,:a,:b,:b]), at = 0.6)
(([1,2,3,4],Symbol[:a,:a,:a,:a]),([5,6],Symbol[:b,:b]))
set1
or set2
. It also doesn’t care if there exist targetsjulia> set1, set2 = splitobs(shuffleobs((1:6, [:a,:a,:a,:a,:b,:b])), at = 0.6)
(([2,3,6,5],Symbol[:a,:a,:b,:b]),([4,1],Symbol[:a,:a]))
julia> set1, set2 = stratifiedobs((1:6, [:a,:a,:a,:a,:b,:b]), p = 0.6)
(([5,4,2],Symbol[:b,:a,:a]),([1,3,6],Symbol[:a,:a,:b]))
All of this is documented: MLDataPattern.jl’s documentation — MLDataPattern.jl 0.1 documentation
There may be a more beginner friendly convenience API sooner or later, but right now my focus is on functionality and a flexible low-level API
I’m blindly converting a Python program into a Julia one.
I thought split* from each language were acting the same.
Anyway, I’m on the learning curve so I guess that it’s good for me if things aren’t that simple.
I’ll look at your recommendations, I thank you for them.