I am learning the basics of ML/Flux.jl and I’m trying to build a simple model to get started. I grabbed some data from the U.S. Census Bureau and went ahead and created a gist on GitHub that contains a small cut (3,000 rows) of the data for your convenience.
The data consist of annual earnings (labels) for > 1,000,000 individual persons as well as the number of hours they work, their occupation code, industry code, race code, age, and years of schooling (features). For now, I’m trying to build a simple model that predicts earnings based on age, years of schooling and occupation.
Here’s what I have so far:
using Flux
using Queryverse
data = load("https://gist.githubusercontent.com/mthelm85/4ac5155462a9d801730eb18470d57904/raw/91e092ccf99237474df019bf4bf0930d6b62b113/wage_data.csv") |> DataFrame
# create OCCP dummy variables
for c in unique(data.OCCP)
data[!, Symbol(c)] = ifelse.(data.OCCP .== c, 1, 0)
end
labels = data.WAGP
# for now, just use age, years of schooling (cols 6 and 7) and then dummy OCCP vars (cols 10:end)
features = permutedims(Matrix(hcat(data[:, [6, 7]], data[:, 10:end])))
labels_norm = Flux.normalise(labels)
features_norm = Flux.normalise(features, dims=2)
# Split into training and test sets, 2/3 for training, 1/3 for test.
train_indices = [1:3:length(labels) ; 2:3:length(labels)]
x_train = features_norm[:, train_indices]
y_train = permutedims(labels_norm[train_indices])
x_test = features_norm[:, 3:3:length(labels)]
y_test = permutedims(labels_norm[3:3:length(labels)])
model = Chain(Dense(size(x_train)[1], 32, relu), Dense(32, 32, relu), Dense(32, 1, σ))
loss(x, y) = Flux.mse(model(x), y)
optimiser = Descent(0.5)
# Train model over 110 epochs.
data_iterator = Iterators.repeated((x_train, y_train), 110)
Flux.train!(loss, params(model), data_iterator, optimiser)
test_results = model(x_test)
The specific questions I have right now are:
-
Is there a better way to deal with large numbers of categorical variables, as in the case of the occupation codes? There are hundreds of occupation and industry codes so this obviously results in the number of features in the model becoming massive, very quickly.
-
Should I normalize the labels or just the features?
-
I’m confused about what my output layer should be here. I assume I only want one output since what I want to do is predict a single variable (the person’s income), but will a sigmoid activation function work for this? I was thinking that, since I’ve normalized the labels, this would work.
-
How do I “un-normalize” the results that I get when passing
x_test
to the model, given that I’ve utilized Flux’s built-innormalise
function?