StratifiedKfold

Hello, I am new in Julia and I am trying to figure out how to use StratifiedKfold. I am using Julia 0.6 and just Kfold now like:

Xdata=readdlm(“data.txt”)
Ytarget=readdlm(“taget.txt”)
folds = kfolds((Xdata,Ytarget),k=6)
(Xtrain1,Ytrain1),(Xtest1,Ytest1)=folds[1] # the first fold and so on…

My data are a lot of numbers (matrix) and differs between 2 patterns -classes, but there are not labels or some number that tell which pattern is. What I know is that from line 1 to 50 is class 1 and from 51 to 90 is class 2.

But I dont understand how to used
julia; collect(StratifiedKfold([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 3))

What are that a, b and c? how I put this in my data? and where I call the Xdata and Ytarget?

Thanks a lot!

1 Like

I couldn’t really figure it out either, it does give a vector of random permutations though

using Random

x = length([:a, :a, :a, :b :b, :c]) ##6
rp = randperm(x) ## shuffled collect(1:6) vector

[ Xdata[rp[1:4] , :]  Ytarget[rp[1:4] ] ## shuffled 4 of 6 data & target

add: ah, they would be the target classes / labels, and don’t necessarily need to be symbols


folds = collect(StratifiedKfold(Ytarget, 6))
(Xtrain1, Ytrain1) = (Xdata[folds[1], :], Ytarget[folds[1]])

:wave:t3:

Is this in MLBase.jl or ScikitLearn.jl?

I’m pretty sure it’s MLBase, the docs have the same example

1 Like

A bit unrelated but could I ask why you want to use Stratified KFold here? it seems you have pretty balanced data?

With Sklearn, I think this does the job:

f(i) = ifelse(i<51, 1, 2)
y = [f(i) for i in 1:90]
using ScikitLearn
folds = ScikitLearn.CrossValidation.StratifiedKFold(y, n_folds=10)

X = randn(90, 20) # say 20 features
fold_1 = X[folds[1][1], :]

Hi it is MLBase.jl

Hi!! Yes it is almost balanced, but I have other cases that it is not. But anyway, if I use kfolds it does not gives me balanced data. If i dont shuffle my data kfolds gives me all class1 data, if I shuffle than sometimes gives 30% class 2 and 70% class1 and so on.
I have a question, I need to take the Xdata and also the Ytarget because Xdata is my dataset and Ytarget is the labels for each samples. Can I include two matrices in StratifiedKfold? As I am using in kfold? thanks!!

I’ve just started learning Julia,here’s my try , my answer may not be too accurate.
Symbol (:a ,:b and :c )is equivalent to a placeholder, but the length of symbols must be equal to the length of the data. The proportion of : a or : b has little effect on the result, but the number(:a or :b) must be greater than k. like this:

#houses is Array
index_row = [i for i = 1:size(houses)[1]]
index_a = [:a for i = 1:size(houses)[1]*0.5]
index_b = [:b for i = 1:size(houses)[1]*0.5]
index = vcat(index_a,index_b)
rows = collect(StratifiedKfold(index, 10))

# pick data
row = rows[1]
houses[row,:]