Advice on rank-hot (thermometer) encoding of categorical variables

opera_malenky · September 9, 2022, 5:54pm

As the title says, I was interested in trying to play around with rank-hot encoding for ordinal data (sometimes also called thermometer encoding).

It’s like one-hot, but instead of a single 1 in the position associated with a particular ordinal rank, all rank indicators up to and below the observed rank are coded as 1’s, while all rank indicators above the observed rank are coded as 0’s.

To give an example, the vector

[1, 3, 2, 5]

should lead to the matrix (assuming observations are row-wise):

I took a stab at making a function that uses CategoricalArrays; it works, but I thought I’d post and see if anyone has suggestions for a better approach.

using CategoricalArrays

v = ["Group A", "Group A", "Group B", "Group C", "Group C", "Group D"]
cv = categorical(v)  # Convert string array to categorical array


function rankhot(x)

    # Get number of levels of ordinal variable
    max_level = maximum(levelcode.(x)) 

    # Number of ordinal observations in vector x
    n = length(x) 

    # Create an N x L matrix (observations x number of levels)
    # to store results
    y = Matrix{Float64}(undef, n, max_level) 

    for i in 1:max_level
        for j in 1:n

            # Check that the current level number is less than or 
            # equal to the level number of the current observation
            # If it is, assign 1, else 0
            y[j, i] = (i <= levelcode(x[j])) ? 1. : 0.
        end
    end

    return y

end

rankhot(cv)

This works fine, but I was curious if anyone had suggestions for improvements - for example, something that uses (and outputs) a DataFrame (which I am planning to try and use more in the future), maybe a simpler function (map or comprehensions?).

Frankly, I had initially planned to take one of the one-hot encoding implementations out there and adapt it, but I admit I found following most of their code a bit difficult to understand (though I admit I didn’t spend a huge amount of time trying). Maybe later.

jar1 · September 9, 2022, 6:31pm

using Compat

julia> rankhot(k, n) = [Int(i≤k) for i in 1:n]

julia> stack(rankhot.([1,2,3,4,5,4,3,2,1], 5); dims=1)
9×5 Matrix{Int64}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1
 1  1  1  1  0
 1  1  1  0  0
 1  1  0  0  0
 1  0  0  0  0

opera_malenky · September 9, 2022, 7:08pm

Nice! Thanks.

I had also attempted something using comprehensions (since it seemed like it should be an elegant approach) that looked very similar:

rankhot2(x) = reshape([i <= levelcode(x[j]) for i=1:5 for j=1:length(x)], length(x), 5)
rankhot2(cv)

Which is very slightly slower than yours.

The original code in the OP actually is rather fast by comparison to both, but since this would likely be a one-time operation on the data before doing any analyses, the speed is not really all that relevant. And it’s nice to see other ways to do this.

jar1 · September 9, 2022, 7:38pm


julia> XS = [1,2,3,4,5,4,3,2,1]

julia> N = maximum(XS);

julia> [Int(i≤x) for x in XS, i in 1:N]

or



julia> map(Base.splat(≥), Iterators.product(XS, 1:N))
9×5 Matrix{Bool}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1
 1  1  1  1  0
 1  1  1  0  0
 1  1  0  0  0
 1  0  0  0  0

Topic		Replies	Views
All the ways to do one-hot encoding General Usage	30	11339	October 20, 2024
Encoding categorical variables within a matrix Machine Learning machine-learning	3	2724	December 28, 2019
Learning Julia: Writing a onehot encoder Tooling	5	1471	October 23, 2019
Learning to Rank with Categorical variables General Usage	0	356	March 30, 2022
CategoricalArray levels! allowmissing Data categoricalarrays	3	448	October 23, 2021

Advice on rank-hot (thermometer) encoding of categorical variables

Related topics