Hello, I am new in Julia and I am trying to figure out how to use StratifiedKfold. I am using Julia 0.6 and just Kfold now like:
folds = kfolds((Xdata,Ytarget),k=6)
(Xtrain1,Ytrain1),(Xtest1,Ytest1)=folds # the first fold and so on…
My data are a lot of numbers (matrix) and differs between 2 patterns -classes, but there are not labels or some number that tell which pattern is. What I know is that from line 1 to 50 is class 1 and from 51 to 90 is class 2.
But I dont understand how to used
julia; collect(StratifiedKfold([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 3))
What are that a, b and c? how I put this in my data? and where I call the Xdata and Ytarget?
Thanks a lot!
I couldn’t really figure it out either, it does give a vector of random permutations though
x = length([:a, :a, :a, :b :b, :c]) ##6
rp = randperm(x) ## shuffled collect(1:6) vector
[ Xdata[rp[1:4] , :] Ytarget[rp[1:4] ] ## shuffled 4 of 6 data & target
add: ah, they would be the target classes / labels, and don’t necessarily need to be symbols
folds = collect(StratifiedKfold(Ytarget, 6))
(Xtrain1, Ytrain1) = (Xdata[folds, :], Ytarget[folds])
Is this in MLBase.jl or ScikitLearn.jl?
I’m pretty sure it’s MLBase, the docs have the same example
A bit unrelated but could I ask why you want to use Stratified KFold here? it seems you have pretty balanced data?
With Sklearn, I think this does the job:
f(i) = ifelse(i<51, 1, 2)
y = [f(i) for i in 1:90]
folds = ScikitLearn.CrossValidation.StratifiedKFold(y, n_folds=10)
X = randn(90, 20) # say 20 features
fold_1 = X[folds, :]
Hi!! Yes it is almost balanced, but I have other cases that it is not. But anyway, if I use kfolds it does not gives me balanced data. If i dont shuffle my data kfolds gives me all class1 data, if I shuffle than sometimes gives 30% class 2 and 70% class1 and so on.
I have a question, I need to take the Xdata and also the Ytarget because Xdata is my dataset and Ytarget is the labels for each samples. Can I include two matrices in StratifiedKfold? As I am using in kfold? thanks!!
I’ve just started learning Julia，here’s my try , my answer may not be too accurate.
Symbol (:a ,:b and :c )is equivalent to a placeholder, but the length of symbols must be equal to the length of the data. The proportion of : a or : b has little effect on the result, but the number（:a or :b） must be greater than k. like this:
#houses is Array
index_row = [i for i = 1:size(houses)]
index_a = [:a for i = 1:size(houses)*0.5]
index_b = [:b for i = 1:size(houses)*0.5]
index = vcat(index_a,index_b)
rows = collect(StratifiedKfold(index, 10))
# pick data
row = rows