Flux, categorical arrays, roc curves, confusion matrices

Dear all,

I’m trying Flux to build a prediction for the occurrence of an event given a data set containing both continuous (Float64) and categorical arrays (Int64). I’m learning from the tutorials and the talk Julia videos which are all very useful. I have a couple of questions - some have been discussed in old threads but I would guess that things have changed in the meanwhile.

  1. I wonder about how to best deal with categorical arrays in Flux (Int64 columns in a Dataframe in my case). As far as i understand, before loading the data, these arrays should be “onehot encoded”. For this should one use other packages like MLJ.jl (and its OneHotEncoder) or is there a build in function that can be use to encode categorical arrays? I have seen the onehot but I’m not sure this can be used to encode a vector with Int64 entries (at least I’ve not managed)?

  2. Is there a a way to generate automatically partition a data set between training and testing (e.g. partition in MLJ)?

  3. Is there a direct way to create ROC curves and confusion matrices to assess the performance of a model at the testing phase using Flux? I’ve seen several packages (MLJ, ROC, ROCcurves) that provide such functionality but I can’t happen to find it in Flux documentation.

I’m not much knowledgeable in ML and I apologise if my questions are inaccurate or if they simply do not apply.

Cheers and thanks!

1 Like
  1. regarding the onehot thing, you may wanna use Flux.onehotbatch(and it does work with Int type)

     julia> Flux.onehotbatch(reshape(rand(0:3, 100), (50,2)), 0:3)
     4×50×2 OneHotArray(::Matrix{UInt32}) with eltype Bool:
     [:, :, 1] =
      0  1  0  0  1  0  0  0  0  0  0  0  0  0  1  0  0  0  0  …  1  0  0  0  1  0  0  1  0  1  0  1  1  0  0  1  1  1  0
      0  0  0  1  0  1  0  0  0  1  0  0  1  0  0  1  0  1  1     0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
      0  0  1  0  0  0  1  1  0  0  1  1  0  1  0  0  1  0  0     0  1  1  0  0  0  0  0  1  0  1  0  0  1  1  0  0  0  0
      1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0
     [:, :, 2] =
      0  0  0  0  1  0  0  1  0  1  1  0  0  0  0  0  0  1  0  …  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      0  0  0  0  0  1  1  0  0  0  0  0  0  0  0  1  0  0  0     0  0  0  1  1  0  0  0  0  0  0  0  0  1  1  0  0  0  0
      0  0  1  1  0  0  0  0  1  0  0  1  0  1  1  0  0  0  0     1  0  0  0  0  1  0  0  0  0  0  0  1  0  0  1  1  0  1
      1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  1     0  1  0  0  0  0  1  1  1  1  1  1  0  0  0  0  0  1  0
  2. you may refer to splitobs from JuliaML/MLUtils.jl: Utilities and abstractions for Machine Learning tasks (github.com).

  3. AFAIK, there are no built-in functions for that in Flux.jl

I am also new to the Julia ML community, if I get something wrong, please point it out! Really appreciate that.

1 Like

thanks a lot!

I now separate the categorical array (Int) from the continuous data (Float).
I convert it to a hot one batch, and then I concatenate the batch together with the continuous data. Then I provide it to the data loader. That seems to work!

Also, thanks a lot for pointing out splitobs from MLUtils, it’s very useful. Too bad it’s not directly accessible via Flux (like DataLoader).

We don’t define any performance metrics in Flux, so these packages you’ve listed are the way to go.

Flux actually imports MLUtils (you can confirm for yourself that Flux.DataLoader === MLUtils.DataLoader), so adding MLUtils to your own environment should have zero impact on latency or the number of libraries loaded.

1 Like

It is nice that my suggestion works. :grinning:

Too bad it’s not directly accessible via Flux (like DataLoader ).

From my experience with Julia, you may not expect all-in-one packages just like python’s pandas or torch and there should be a reason for it I guess since the loading time for Flux is already high, and it may be reasonable to separate functionalities beyond core ML framework. But they do have a page in the docs to demonstrate how to handle data for training using MLUtils.

PS: JuliaHub could be helpful when finding packages

1 Like

I rewrote a TF tutorial about speech recognition (classification of simple voice commands using CNN on FFT feature) in Julia / Flux / Pluto / Makie.

I think it could help. I had to compute the confusion matrix and did a plot using heatmap.

Pluto Notebook HTML


Pluto Notebook code



@Frankiewaang: in fact I’ve just realised that what I do is incorrect. By concatenating the categorical part of the data (OneHotMatrix(::Vector{UInt32})) together with the continuous part of data (Matrix{Float64}), the whole data is promoted to Matrix{Float64}, which I guess is incorrect (?).
So what would be the correct way to define training data X_train such that it contains informations from both the continuous and categorical data?

For example, I now have training data: X_cont (Matrix{Float64}), X_cat (OneHotMatrix(::Vector{UInt32})) and the truth y_train (OneHotMatrix(::Vector{UInt32})). What is the correct way to “merge” X_cont and X_cat such that it will be treated as a single training data X_train, which can then provide to the data loader?
loader = Flux.DataLoader((X_train, y_train), batchsize=64, shuffle=true);

@ToucheSir: thanks a lot for the clarification. I was initially just confused that one can access the loader (which comes from MLUtils as I understand) with Flux.DataLoader but not Flux.splitobs. Now, I simply load MLUtils without overhead which is very nice :smiley:

@cirocavani: many thanks for the link. I will definitely study it in detail!

it really depends on your model, some tree models like XGBoost or LightGBM may require you to specify the categorical variables, but for Neural Network just concat them and feed into your model could be enough(the neurons could learn that from the data I suppose).

I am not an expertise, but if you could share the backbone of your model, I may have some thoughts.

Ok,I see. I just found weird to first one-hot encode an array and then concatenate together with a Float array. Then, the one-hot encoded data is promoted to Float. In my mind it’s a bit like turning a sparse matrix to a dense one. I thought there would have been a different way of doing this. But maybe it’s just my intuition that is wrong.
I keep looking for an example that uses both floating and categorical types as X data for Flux. If anyone knows an example, I would be delighted :smiley:
Thanks again for the advices!

@Gravlax It seems to me that all the functionality you are asking for is available from MLJ and demonstrated in this one tutorial. The tutorial demonstrates the use of EvoTrees.jl (a pure Julia gradient tree boosting implementation like XGBoost and LightGBM) but MLJ has interfaces for almost two hundred models, including neural networks (via MLJFlux.jl). Given your mixed data types, it sounds like you have structured data and my suggestion would be to try EvoTrees.jl first. Neural networks are the models of choice for image and video, sometimes good for audio, natural language processing and other “sequence” data, but generally not good for structured data, and more difficult to train than many other models.

If you still want to try a neural network through the MLJ interface, then this MLJFlux tutorial may be helpful.


just to pinpoint two ideas (more generally speaking).

If your categotical feature has many values, you may try to use Embeddings. This way you can create a dense vector representing something relevant in your dataset. With one-hot of N categories and a first dense layer of M neurons, you will have N x M weights (your input is sparse but your network learn a “connection” from each bit to each neuron). Using Embedding, each category index will be mapped to a dense vector of dimension K, considering N >> K, your network will be “smaller” (those N x K will be learned). This is usually used with Text (word2vec) and extended to “latent spaces” or other more complex “encoding” architecture. One-hot is the way to go in most cases.

The other thing, for more complex architecture, there is an option to use “multi-stream” networks where you can split your X in different inputs (x1, x2, … xn) and combine network streams to get an output and a loss (multiple outputs/losses are also possible). In this case, you can create layers do “featurize” each relevant part of your input an then concatenate a combined output, then more layers to create the final output. I think this mostly used to combine multi modal inputs, like audio / image / text in a single network trained end to end.

I don’t have examples to link, but Flux seems to support both ideas.



I hope these help your consider different approaches (and explore the field “deeply”), but as @ablaom said, MLJ with a gradient boosting algorithm should be a better solution for most of tabular datasets.

1 Like

thanks again, it is very insightful! I have tried to reproduce your example but came across this issue:

NNR = @load  NeuralNetworkRegressor
ArgumentError: Ambiguous model type name. Use pkg=... .
The model NeuralNetworkRegressor is provided by these packages:
 ["MLJFlux", "BetaML"].

so I’ve tried:

NNR = @load  MLJFlux.NeuralNetworkRegressor
ArgumentError: There is no model named "MLJFlux.NeuralNetworkRegressor" in the registry. 
 Run `models()` to view all registered models, or `models(needle)` to restrict search to models with string `needle` in their name or documentation. 

So then I’ve checked if it exists:

  builder = Linear(
        σ = NNlib.relu),...

I’m not sure what I’m doing wrong but I have an issue loading the regressor. I’ve loaded the packages in the same order than in your example.

Following the suggestion provided in the error, if you do

NNR = @load NeuralNetworkRegressor pkg = MLJFlux

that should load the regressor properly. Hope that helps!

1 Like

perfect, it does the job. Never met this error message before, neither the syntax :blush:

1 Like

The @load syntax is MLJ specific. For more on loading code for models, see here.