I have a big dataset (250k rows x 30 columns) with which I would like to train a neural network for a classification task (it should classify one of two possible classes). I’ve built this model using Python’s MLPClassifier and I got an accuracy of around 83%: not amazing but it shows that it somewhat works. I’ve then tried to replicate this in Julia using Flux.jl. Here is my code:
using Flux, DataFrames, DataFramesMeta, CSV
using Chain: @chain
using StatsBase: standardize, ZScoreTransform
using MLDataUtils: splitobs, shuffleobs
using IterTools: ncycle
function build_model(input, layers, output; activation = relu)
f = []
in_layer = input
for out_layer in layers
append!(f, [Dense(in_layer, out_layer, activation)])
in_layer = out_layer
end
append!(f, [Dense(in_layer, output)])
append!(f, [softmax])
Chain(f...)
end
filename = raw"E:\Università\2020-2021\Applicazioni di Machine Learning\atlas_data.csv"
df, labels = @chain begin
CSV.read(filename, DataFrame)
@where(_, :KaggleSet .== "t") # this is just to select a subset of the dataset
select(_, Not([:Weight, :EventId, :KaggleSet, :KaggleWeight])) # these are columns to ignore
select(_, Not(:Label)), @chain _ begin
select(_, :Label)
Flux.onehotbatch(_.Label, ["s", "b"]) # "s" and "b" are the labels for the classes
end
end
N_input = length(names(df))
N_output = size(labels, 1)
X = transpose(standardize(ZScoreTransform, Matrix(df)))
X_train, X_test = splitobs(shuffleobs(X), at = 0.7)
y_train, y_test = splitobs(shuffleobs(labels), at = 0.7)
model = build_model(N, [20, 10, 2], N_output)
loss(a, b) = Flux.Losses.mse(model(a), b)
ps = Flux.params(model)
opt = ADAM(1e-3, (0.9, 0.999))
batchsize = 200
n_epochs = 200
loader = Flux.Data.DataLoader(
(X_train, y_train),
batchsize = batchsize,
shuffle = true
)
Flux.@epochs n_epochs begin
Flux.train!(loss, ps, loader, opt)
println(loss(X_train, y_train))
end
I have used basically the same parameters, the only difference are the cost function (but I’ve also tried with Flux’s crossentropy which should be the one sklearn uses in MLPClassifier) and the output layer (on Python I’ve used only one neuron as output, but in Julia I’m using two so I can use onehotbatch
, and it should also be more correct). The rest is pretty much identical, but I’ve already tinkered with various models and parameters.
Here is the problem: the loss function (which I print every epoch) gets to a stable value immediately (in two or three iterations) and remains like this. If I stop the program and call model(X_train)
I’ve noticed that every datapoint is mapped to basically the same values of the two output neurons, which sometimes are ~0.3 for a class and ~0.6 for another, while other times (I believe changing loss function does this) one class as a value of 1.0 and the other is basically 0.0 (again, for every datapoint, as if every single entry of my dataset belonged to a single class).
I know this may not be strictly a Julia related question, but since I’ve tried with the same exact dataset on Python getting an accuracy on ~83%, I guess that the problem is not the (theoretical) model, but the way I have implemented it in Flux.
Note that the dataset manipulation is not the problem. I’ve ignored the same exact columns in Python, and the subset selected is the same. In fact, the dataframe in Julia has 250k rows just like the one in Python at the end of the mmanipulations. The problem relies on what I did after, in the model implementation.
Can you please help me? Thank you