As the title says, I was interested in trying to play around with rank-hot encoding for ordinal data (sometimes also called thermometer encoding).
It’s like one-hot, but instead of a single 1 in the position associated with a particular ordinal rank, all rank indicators up to and below the observed rank are coded as 1’s, while all rank indicators above the observed rank are coded as 0’s.
To give an example, the vector
[1, 3, 2, 5]
should lead to the matrix (assuming observations are row-wise):
1 0 0 0 0
1 1 1 0 0
1 1 0 0 0
1 1 1 1 1
I took a stab at making a function that uses CategoricalArrays; it works, but I thought I’d post and see if anyone has suggestions for a better approach.
using CategoricalArrays
v = ["Group A", "Group A", "Group B", "Group C", "Group C", "Group D"]
cv = categorical(v) # Convert string array to categorical array
function rankhot(x)
# Get number of levels of ordinal variable
max_level = maximum(levelcode.(x))
# Number of ordinal observations in vector x
n = length(x)
# Create an N x L matrix (observations x number of levels)
# to store results
y = Matrix{Float64}(undef, n, max_level)
for i in 1:max_level
for j in 1:n
# Check that the current level number is less than or
# equal to the level number of the current observation
# If it is, assign 1, else 0
y[j, i] = (i <= levelcode(x[j])) ? 1. : 0.
end
end
return y
end
rankhot(cv)
This works fine, but I was curious if anyone had suggestions for improvements - for example, something that uses (and outputs) a DataFrame (which I am planning to try and use more in the future), maybe a simpler function (map or comprehensions?).
Frankly, I had initially planned to take one of the one-hot encoding implementations out there and adapt it, but I admit I found following most of their code a bit difficult to understand (though I admit I didn’t spend a huge amount of time trying). Maybe later.