Learning Julia: Writing a onehot encoder

Hello fellow Julia coders. I’m relatively new to Julia and one of the first things I like to do is translate another script into my new working language. I was working on a simple machine learning project with the Kaggle Titanic dataset to predict survival and decided I needed a way to onehot encode the String (and similar String-like data Types). I found a number of options within other packages but felt this was a moderately advanced challenge to try and code myself.

I’d be interested in feedback and thoughts on how to improve this function and hope you find it interesting as you continue to learn in the Julia world!

function onehotenc(df, dropnums = false, droplast = false)

    # Near the top of the function ensure that a DataFrame is passed into the function with @assert
    @assert typeof(df) == DataFrame "onehotenc expecting a DataFrame"

    # If the dropnums parameter is set to true then call the keepstrings!() function
    # Pull out the string or categorical variables to be onehot-encoded
    df = keepstrings(df)
    #println("keepstrings")
    
    # For each column to be onehot-encoded
    # Set up a counter for the attributes that are onehot-encoded
    ct = 0

    # Set aside a DataFrame for building the onehot-encoding DataFrame
    onehot = DataFrame()

    # Loop to develop onehot-encoding
    for col in eachcol(df)
        # Increment the counter for our first attribute
        ct += 1
        #println("col = $col")
        
        # Save the length of the original set of values
        len = length(col)
        #println("len = $len")

        # Save the unique values (or Set) and how many there are for our initial zeros matrix
        vals = unique(col)
        lenvals = length(vals)
        #println("vals = $vals, and lenvals = $lenvals")

        # Save a DataFrame of a Matrix of zeros of the length of original rows (rows), and length of unique values (cols)
        zero = DataFrame(zeros(len, lenvals))
        #println("Created the zero dataframe")

        # Rename the new zeros DF columns with the unique values of the attribute being onehot-encoded NEW FUNCTION?
        # If droplast = true then don't convert the last attribute to a onehot column
        z_ct = 0
        for name in names(zero)
            z_ct += 1
            rename!(zero, name => Symbol(vals[z_ct]))
        end # for
        #println("renamezeros")

        # Change the value from 0 to 1 in the zero DataFrame based on whether the unique value matches the original DF
        for col_zero in 1:size(zero, 2) 
            zero[!, col_zero] = [col[i] == vals[col_zero] for i in eachindex(col)]
            #println("col = $col, col_zero = $col_zero")
        end

        # Logic to get the master zeros DataFrame loaded
        if ct == 1
            #println("Copying zero to onehot")
            onehot = copy(zero) 
        else
            #println("hcat zero to onehot")
            onehot = hcat(onehot, zero)
        end
    end
    
    return onehot
    
end

First, welcome.

Several suggestions:

  • First, the code is not well quoted, the beginning and final end is outside, like normal text. Check the ```julia.

  • Do not use comparisons with typeof. In that case is better to put df::DataFrame in the head of the function.

  • DataFrame zero should be Bool, in your case is Float.

  • zero is a very bad variable name.

  • You mention keepstring is called or not based on dropnums, but it is always called. Also, it is not included.

  • Please, do not use println like that, if you want to show information,use Logs https://docs.julialang.org/en/v1/stdlib/Logging/index.html. Also, if you want to observe the values of each variables, you can use the debug or using IJulia to code in a more interative way.

  • Why are you creating the DataFrame and later rename the columns? You should use a Dict with the right names, and later create a DataFrame from there.

  • Use enumerate instance of increasing z_ct.

  • The hcat at the end it seems strange for me.

I cannot help you more in detail, because the source is not working for me, it gives me error in the line assigning values to zero.
IMHO, the source code could be a lot more clear and concise.

Really appreciate the insights.

Now I’ve got more opportunity to learn new things: updating quotes, skipping comparisons with typeof(), appropriately Typing a new DataFrame, cleaning up variable names, using logs, starting with a Dict instead of a DataFrame and then converting, enumerating instead of counters.

Quick follow up on logs: I’m using Jupyter notebooks right now as a carryover from Python work. I’ve had a hell of a time using other IDEs such as VSCode and getting it to work. Jupyter doesn’t have a variables or environment “window” like many of these IDEs. Thoughts on a better or more popular Julia coding environment?

Variable names are a given in terms of improvement; like I said this is a first shot at creating a useful function in a language I’ve been working with for two weeks.

Thanks again for the suggestions.

Your loop until the hcat part (that it does not work) could be updated to:

for col in eachcol(df)
    len=length(col)
    # Save the unique values (or Set) and how many there are for our initial zeros matrix
    vals = unique(col)
    dict = Dict(Symbol(val)=>zeros(Bool, len) for val in vals)
    result = DataFrame(dict)

    # Change the value from 0 to 1 in the zero DataFrame based on whether the unique value matches the original DF
    for (i, v) in enumerate(col)
        result[i,Symbol(v)]=true
    end
    ....

As you can see it is a lot simpler.

Have you tried Juno? It is very stable and working. I do not understand abvout the “window”, Jupyter is a well-known web environment for interative programming in Julia/Python/R. Juno is an IDE.

I believe @Sam_Johnson was talking about how Jupyter notebooks don’t have a variable inspector (without extensions). Juno has the Workspace tab by default, so should solve that problem.

1 Like