Help me build a Flux.jl model and learn basic ML concepts

I am learning the basics of ML/Flux.jl and I’m trying to build a simple model to get started. I grabbed some data from the U.S. Census Bureau and went ahead and created a gist on GitHub that contains a small cut (3,000 rows) of the data for your convenience.

The data consist of annual earnings (labels) for > 1,000,000 individual persons as well as the number of hours they work, their occupation code, industry code, race code, age, and years of schooling (features). For now, I’m trying to build a simple model that predicts earnings based on age, years of schooling and occupation.

Here’s what I have so far:

using Flux
using Queryverse

data = load("") |> DataFrame

# create OCCP dummy variables
for c in unique(data.OCCP)
    data[!, Symbol(c)] = ifelse.(data.OCCP .== c, 1, 0)

labels = data.WAGP

# for now, just use age, years of schooling (cols 6 and 7) and then dummy OCCP vars (cols 10:end)
features = permutedims(Matrix(hcat(data[:, [6, 7]], data[:, 10:end])))

labels_norm = Flux.normalise(labels)
features_norm = Flux.normalise(features, dims=2)

# Split into training and test sets, 2/3 for training, 1/3 for test.
train_indices = [1:3:length(labels) ; 2:3:length(labels)]

x_train = features_norm[:, train_indices]
y_train = permutedims(labels_norm[train_indices])

x_test = features_norm[:, 3:3:length(labels)]
y_test = permutedims(labels_norm[3:3:length(labels)])

model = Chain(Dense(size(x_train)[1], 32, relu), Dense(32, 32, relu), Dense(32, 1, σ))
loss(x, y) = Flux.mse(model(x), y)
optimiser = Descent(0.5)

# Train model over 110 epochs.
data_iterator = Iterators.repeated((x_train, y_train), 110)

Flux.train!(loss, params(model), data_iterator, optimiser)
test_results = model(x_test)

The specific questions I have right now are:

  1. Is there a better way to deal with large numbers of categorical variables, as in the case of the occupation codes? There are hundreds of occupation and industry codes so this obviously results in the number of features in the model becoming massive, very quickly.

  2. Should I normalize the labels or just the features?

  3. I’m confused about what my output layer should be here. I assume I only want one output since what I want to do is predict a single variable (the person’s income), but will a sigmoid activation function work for this? I was thinking that, since I’ve normalized the labels, this would work.

  4. How do I “un-normalize” the results that I get when passing x_test to the model, given that I’ve utilized Flux’s built-in normalise function?

In my opinion, neural nets in general are not the best way to start getting acquainted wih ML. Flux.jl is a nice package but one needs to be familiar with the complexities associated with neural networks in general. I can recommend this as perhaps one of the best ways to gently approach numerical analysis and learning in julia. Also, the stanford courses on ML are pretty good. As for your questions:

  1. It depends, you may either drop some variables, use an algorithm that does it or feed them all in the model and ‘hope’ (through a good architecture) that it optimizes fast and generalizes well; there is no universal recipe
  2. For labels it will not make a difference (in regression it would, depending on how the normalization is done); for features it depends again
  3. I am not an expert but for classification, softmax is a popular output function
  4. Find out what normalization does and apply the reverse operation, cannot help much here
    MLDataPattern.jl has a nice example for testing the generalization of models (train & test of various bits of the data)