Advice on rank-hot (thermometer) encoding of categorical variables

As the title says, I was interested in trying to play around with rank-hot encoding for ordinal data (sometimes also called thermometer encoding).

It’s like one-hot, but instead of a single 1 in the position associated with a particular ordinal rank, all rank indicators up to and below the observed rank are coded as 1’s, while all rank indicators above the observed rank are coded as 0’s.

To give an example, the vector

[1, 3, 2, 5]

should lead to the matrix (assuming observations are row-wise):

1 0 0 0 0
1 1 1 0 0 
1 1 0 0 0 
1 1 1 1 1

I took a stab at making a function that uses CategoricalArrays; it works, but I thought I’d post and see if anyone has suggestions for a better approach.

using CategoricalArrays

v = ["Group A", "Group A", "Group B", "Group C", "Group C", "Group D"]
cv = categorical(v)  # Convert string array to categorical array


function rankhot(x)

    # Get number of levels of ordinal variable
    max_level = maximum(levelcode.(x)) 

    # Number of ordinal observations in vector x
    n = length(x) 

    # Create an N x L matrix (observations x number of levels)
    # to store results
    y = Matrix{Float64}(undef, n, max_level) 

    for i in 1:max_level
        for j in 1:n

            # Check that the current level number is less than or 
            # equal to the level number of the current observation
            # If it is, assign 1, else 0
            y[j, i] = (i <= levelcode(x[j])) ? 1. : 0.
        end
    end

    return y

end

rankhot(cv)

This works fine, but I was curious if anyone had suggestions for improvements - for example, something that uses (and outputs) a DataFrame (which I am planning to try and use more in the future), maybe a simpler function (map or comprehensions?).

Frankly, I had initially planned to take one of the one-hot encoding implementations out there and adapt it, but I admit I found following most of their code a bit difficult to understand (though I admit I didn’t spend a huge amount of time trying). :slight_smile: Maybe later.

using Compat

julia> rankhot(k, n) = [Int(i≤k) for i in 1:n]

julia> stack(rankhot.([1,2,3,4,5,4,3,2,1], 5); dims=1)
9×5 Matrix{Int64}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1
 1  1  1  1  0
 1  1  1  0  0
 1  1  0  0  0
 1  0  0  0  0

Nice! Thanks.

I had also attempted something using comprehensions (since it seemed like it should be an elegant approach) that looked very similar:

rankhot2(x) = reshape([i <= levelcode(x[j]) for i=1:5 for j=1:length(x)], length(x), 5)
rankhot2(cv)

Which is very slightly slower than yours.

The original code in the OP actually is rather fast by comparison to both, but since this would likely be a one-time operation on the data before doing any analyses, the speed is not really all that relevant. And it’s nice to see other ways to do this.


julia> XS = [1,2,3,4,5,4,3,2,1]

julia> N = maximum(XS);

julia> [Int(i≤x) for x in XS, i in 1:N]

or



julia> map(Base.splat(≥), Iterators.product(XS, 1:N))
9×5 Matrix{Bool}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1
 1  1  1  1  0
 1  1  1  0  0
 1  1  0  0  0
 1  0  0  0  0
1 Like