Can't replicate neural network from Python's sklearn using Flux.jl

I have a big dataset (250k rows x 30 columns) with which I would like to train a neural network for a classification task (it should classify one of two possible classes). I’ve built this model using Python’s MLPClassifier and I got an accuracy of around 83%: not amazing but it shows that it somewhat works. I’ve then tried to replicate this in Julia using Flux.jl. Here is my code:

using Flux, DataFrames, DataFramesMeta, CSV
using Chain: @chain
using StatsBase: standardize, ZScoreTransform
using MLDataUtils: splitobs, shuffleobs
using IterTools: ncycle

function build_model(input, layers, output; activation = relu)
	f = []
	in_layer = input
	for out_layer in layers
		append!(f, [Dense(in_layer, out_layer, activation)])
		in_layer = out_layer
	end
	append!(f, [Dense(in_layer, output)])
	append!(f, [softmax])
	Chain(f...)
end

filename = raw"E:\Università\2020-2021\Applicazioni di Machine Learning\atlas_data.csv"

df, labels = @chain begin
	CSV.read(filename, DataFrame)
	@where(_, :KaggleSet .== "t") # this is just to select a subset of the dataset
	select(_, Not([:Weight, :EventId, :KaggleSet, :KaggleWeight])) # these are columns to ignore
	select(_, Not(:Label)), @chain _ begin
		select(_, :Label)
		Flux.onehotbatch(_.Label, ["s", "b"]) # "s" and "b" are the labels for the classes
	end
end

N_input  = length(names(df))
N_output = size(labels, 1)

X = transpose(standardize(ZScoreTransform, Matrix(df)))
X_train, X_test = splitobs(shuffleobs(X), at = 0.7)
y_train, y_test = splitobs(shuffleobs(labels), at = 0.7)

model = build_model(N, [20, 10, 2], N_output)

loss(a, b) = Flux.Losses.mse(model(a), b)
ps = Flux.params(model)
opt = ADAM(1e-3, (0.9, 0.999))

batchsize = 200
n_epochs = 200

loader = Flux.Data.DataLoader(
	(X_train, y_train),
	batchsize = batchsize,
	shuffle = true
)

Flux.@epochs n_epochs begin
	Flux.train!(loss, ps, loader, opt)
	println(loss(X_train, y_train))
end

I have used basically the same parameters, the only difference are the cost function (but I’ve also tried with Flux’s crossentropy which should be the one sklearn uses in MLPClassifier) and the output layer (on Python I’ve used only one neuron as output, but in Julia I’m using two so I can use onehotbatch, and it should also be more correct). The rest is pretty much identical, but I’ve already tinkered with various models and parameters.
Here is the problem: the loss function (which I print every epoch) gets to a stable value immediately (in two or three iterations) and remains like this. If I stop the program and call model(X_train) I’ve noticed that every datapoint is mapped to basically the same values of the two output neurons, which sometimes are ~0.3 for a class and ~0.6 for another, while other times (I believe changing loss function does this) one class as a value of 1.0 and the other is basically 0.0 (again, for every datapoint, as if every single entry of my dataset belonged to a single class).
I know this may not be strictly a Julia related question, but since I’ve tried with the same exact dataset on Python getting an accuracy on ~83%, I guess that the problem is not the (theoretical) model, but the way I have implemented it in Flux.
Note that the dataset manipulation is not the problem. I’ve ignored the same exact columns in Python, and the subset selected is the same. In fact, the dataframe in Julia has 250k rows just like the one in Python at the end of the mmanipulations. The problem relies on what I did after, in the model implementation.
Can you please help me? Thank you

I think you are separately shuffling your data and labels, applying a different order to each. Instead, just just shuffle once and then split:

sX, slabels = shuffleobs((X, labels))
X_train, X_test = splitobs(sX, at = 0.7)
y_train, y_test = splitobs(slabels, at = 0.7)
2 Likes

Hi, I am curious about this. Is the data public ?

Ah I see, thank you very much. Now that I think about it, if I set shuffle=true in DataLoader, it is not necessary to call shuffleobs, right?

Yes, here is the link: CERN Open Data Portal

Basically I have to distinguish a proper signal from the background

Just for fun I did try with a ML library I implemented:

# Data from here: https://www.kaggle.com/c/higgs-boson/

cd(@__DIR__)
using Pkg
Pkg.activate(".")
#Pkg.add(["CSV", "DataFrame","BetaML"]) # Only first time to install the dependent packages

using CSV, DataFrames, BetaML
data    = CSV.File("training.csv",delim=',') |> DataFrame
describe(data) # 250k rows, 33 cols


X    = Array{Float64,2}(data[:,2:31])
y    = integerEncoder(Array{String}(data[:,33]),factors=["s","b"]);
y_oh = oneHotEncoder(y)
D    = size(X,2)
nCl  = size(y_oh,2)


# With the specified parameters
(xtrain,xval), (ytrain,yval), (ytrain_oh,yval_oh) = partition([X,y,y_oh],[0.7,1-0.7],rng=copy(FIXEDRNG))

l1   = DenseLayer(D,20,f=relu,rng=copy(FIXEDRNG))
l2   = DenseLayer(20,10,f=relu,rng=copy(FIXEDRNG))
l3   = DenseLayer(10,2,f=relu,rng=copy(FIXEDRNG))
l4   = VectorFunctionLayer(nCl,f=softmax) ## Add a (parameterless) layer whose activation function (softMax in this case) is defined to all its nodes at once
mynn = buildNetwork([l1,l2,l3,l4],squaredCost,name="logistic regression boson")

res  = train!(mynn,xtrain,ytrain_oh,epochs=200,batchSize=200,optAlg=ADAM(),rng=copy(FIXEDRNG)) # same as optAlg=ADAM(η=t->1e-3,β₁=0.9, β₂=0.999)

(ŷtrain,ŷval)   = predict.(Ref(mynn),[xtrain,xval])
trainAccuracy   = accuracy(ŷtrain,ytrain,rng=copy(FIXEDRNG)) # 0.742
valAccuracy     = accuracy(ŷval,yval,rng=copy(FIXEDRNG))     # 0.742

# With scaling and different model/parameters...
xScaleFactors   = getScaleFactors(X)

(xtrain,xval), (ytrain,yval), (ytrain_oh,yval_oh) = partition([X,y,y_oh],[0.1,1-0.1],rng=copy(FIXEDRNG))
l1    = DenseLayer(D,35,f=relu,rng=copy(FIXEDRNG))
l3    = DenseLayer(35,2,f=relu,rng=copy(FIXEDRNG))
l4    = VectorFunctionLayer(nCl,f=softmax) # Add a (parameterless) layer whose activation function (softMax in this case) is defined to all its nodes at once
mynn2 = buildNetwork([l1,l3,l4],squaredCost,name="logistic regression boson")

res  = train!(mynn2,scale(xtrain,xScaleFactors),ytrain_oh,epochs=200,batchSize=64,optAlg=ADAM(),rng=copy(FIXEDRNG))

(ŷtrain,ŷval)   = predict.(Ref(mynn),[scale(xtrain,xScaleFactors),scale(xval,xScaleFactors)])
trainAccuracy   = accuracy(ŷtrain,ytrain,rng=copy(FIXEDRNG)) # 0.82
valAccuracy     = accuracy(ŷtrain,ytrain,rng=copy(FIXEDRNG)) # 0.82
1 Like

I’ve tried with my program using your parameters (the last ones, where you get 82% of accuracy), but I can’t get over 80%. I have ADAM with 1e-3, 0.9, 0.999 as alpha, beta1 and beta2; batchsize = 64; one single hidden layer with 35 neurons; both the hidden layer and the output have a relu as activation function, and I’ve added a final softmax; I’ve scaled the data using StatsBase’s ZScoreTransform (which I’m guessing is the same as getScaleFactors you’re using); at least 200 epochs. The only difference is that I’m using 30 columns of data, but tha’ts because 4 columns are not useful and 1 is the labels’ column.

1 Like

It would help immensely if you shared the hyperparams you used for the MLPClassifier as well. For example, I noticed they use a default L2 penalty coeffcient of 1e-4 whereas Flux applies none (unless you code it yourself).

Also, if you may enjoy GitHub - FluxML/MLJFlux.jl: An interface to the deep learning package Flux.jl from the MLJ.jl toolbox if you like the Sklearn interface. It’s a little more “batteries included” than base Flux for certain workflows.

It may just be due to the random initial weights… after all, we are giving just a single trial…
BetaML ADAM defaults are the same as the one you cited… also the X cols seem the same:

julia> describe(data[:,2:31])
30×7 DataFrame
 Row │ variable                     mean           min       median     max       nmissing  eltype   
     │ Symbol                       Float64        Real      Float64    Real      Int64     DataType 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ DER_mass_MMC                  -49.0231      -999.0     105.012   1192.03          0  Float64
   2 │ DER_mass_transverse_met_lep    49.2398         0.0      46.524    690.075         0  Float64
   3 │ DER_mass_vis                   81.182          6.329    73.752   1349.35          0  Float64
   4 │ DER_pt_h                       57.896          0.0      38.4675  2835.0           0  Float64
   5 │ DER_deltaeta_jet_jet         -708.421       -999.0    -999.0        8.503         0  Float64
   6 │ DER_mass_jet_jet             -601.237       -999.0    -999.0     4974.98          0  Float64
   7 │ DER_prodeta_jet_jet          -709.357       -999.0    -999.0       16.69          0  Float64
   8 │ DER_deltar_tau_lep              2.3731         0.208     2.4915     5.684         0  Float64
   9 │ DER_pt_tot                     18.9173         0.0      12.3155  2835.0           0  Float64
  10 │ DER_sum_pt                    158.432         46.104   120.665   1852.46          0  Float64
  11 │ DER_pt_ratio_lep_tau            1.43761        0.047     1.28      19.773         0  Float64
  12 │ DER_met_phi_centrality         -0.128305      -1.414    -0.356      1.414         0  Float64
  13 │ DER_lep_eta_centrality       -708.985       -999.0    -999.0        1.0           0  Float64
  14 │ PRI_tau_pt                     38.7074        20.0      31.804    764.408         0  Float64
  15 │ PRI_tau_eta                    -0.010973      -2.499    -0.023      2.497         0  Float64
  16 │ PRI_tau_phi                    -0.00817107    -3.142    -0.033      3.142         0  Float64
  17 │ PRI_lep_pt                     46.6602        26.0      40.516    560.271         0  Float64
  18 │ PRI_lep_eta                    -0.0195075     -2.505    -0.045      2.503         0  Float64
  19 │ PRI_lep_phi                     0.043543      -3.142     0.086      3.142         0  Float64
  20 │ PRI_met                        41.7172         0.109    34.802   2842.62          0  Float64
  21 │ PRI_met_phi                    -0.0101192     -3.142    -0.024      3.142         0  Float64
  22 │ PRI_met_sumet                 209.797         13.678   179.739   2003.98          0  Float64
  23 │ PRI_jet_num                     0.979176       0         1.0        3             0  Int64
  24 │ PRI_jet_leading_pt           -348.33        -999.0      38.96    1120.57          0  Float64
  25 │ PRI_jet_leading_eta          -399.254       -999.0      -1.872      4.499         0  Float64
  26 │ PRI_jet_leading_phi          -399.26        -999.0      -2.093      3.141         0  Float64
  27 │ PRI_jet_subleading_pt        -692.381       -999.0    -999.0      721.456         0  Float64
  28 │ PRI_jet_subleading_eta       -709.122       -999.0    -999.0        4.5           0  Float64
  29 │ PRI_jet_subleading_phi       -709.119       -999.0    -999.0        3.142         0  Float64
  30 │ PRI_jet_all_pt                 73.0646        -0.0      40.5125  1633.43          0  Float64

(I will be unable to reply for a week… )

Edit: I notice now that in my code I compute the train accuracy twice, I am not computing the second time the validation accuracy. . sorry, I can’t correct the script now…

1 Like

I’m trying to build the example provided in the Github page but, descpite having added MLJ and MLJFlux and having called using MLJ, MLJFlux, I keep getting a not defined error, as if the functions used are not part of these packages. For example, I don’t know how to use Short in NeuralNetworkClassifier, it keeps saying it’s not defined. This is true also for NNlib, but I’ve added Flux. in front and it works. I don’t know why the example doesn’t do this.

Ok, I was able to build a NeuralNetworkClassifier. Everything should be correct and using the same parameters as the code in my original post. Is it possible that using fit! and evaluate! of MLJFlux is a lot slower than using Flux directly? My code above run to completion in about 10 minutes, even the first time I run the program, while evaluating or fitting in MLJFlux is taking a lot longer (the dataset is the same).

Sorry, by NeuralNetworkClassifier I meant the sklearn model (which has somewhere in the neighbourhood of 20 hyperparams).

Any overhead MLJFlux adds should be negligible, so I think there may be something else at play.