CSV.read: why do String columns show up as PooledArrays?

I’m opening a simple .CSV file with CSV.read with no parameter modifiers. The returned DataFrame is fine and describe(df) gives me the correct element types, which are all Int64 or Strings.

When I try to onehot-encode the String attributes the function fails because the String attributes are not Strings inside the function. When I examine the attribute with typeof(df.prev_campaign_outcome) I get:

PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}

How has Julia made this decision? Why does describe() return the eltype as String but internally has converted the attribute to PooledArray? How can I force Julia to leave well enough alone?

Why does it seem Julia is trying too hard?

First of all, it’s not Julia, it’s CSV.jl.

Second, I’m not sure what you mean by

Indeed, PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}} should have an eltype of String which you should be able to confirm. It ought to work just like any other AbstractVector{String}. Could you be more specific about the problem it’s causing?

As to why CSV.jl constructs columns like this, it is common for tables to have string-valued columns with a number of possible values much smaller than the number of rows. In such cases, it’s much more efficient to use a PooledArray, which is sort of like a simple dict.

1 Like

Here’s the onehot-encoder function I wrote. It’s now failing by saying the return value “onehot” is not defined.

function onehotenc(df::DataFrame)

    # Loop to develop onehot-encoding
    for col in eachcol(df)
        
        # Can we determine if the column is a String and only encode it if it is?
        if col isa String
            
            # How long is the current column we're going to onehot-encode?
            len = length(col)

            # Save the unique values (or Set) and how many there are for our initial zeros matrix
            vals = unique(col)

            # Set up the Dict key-value pairs to save the Boolean operation results
            # For each val in vals make that val a key (:key) with 
            # and fill that column with Boolean falses to begin with
            dict = Dict(Symbol(val) => falses(len) for val in vals)

            # Once the Dict is populated convert it to a DataFrame
            onehot = DataFrame(dict)

            # Change the value from 0 to 1 in the zero DataFrame based on whether the unique value matches the original DF
            for (i, v) in enumerate(col)
                # At the row and col of the current enumerated col value, set the value to true
                # EX: In the onehot DF at row 1, column :A set equal to true
                # EX: In the onehot DF at row 2, column :B set equal to true
                onehot[i, Symbol(v)] = true
            end
        end
    end
    
    return onehot
    
end

You’ll want to change if col isa String to if eltype(col) isa String, or even better, if eltype(col) isa AbstractString.

The type of col will never be String because it is a vector of Strings, so you need to test the element type of that vector.

1 Like

Thanks, David. I blundered that comparison.

Any thoughts on why the function returns “UndefVarError: onehot not defined”?

Take a look at the scoping part of the documentation. You are defining the onehot variable inside a loop, and then it won’t be visible outside of the loop.

Even if that worked with our scope rules, it’d still be a bug. You’re generating a new dictionary for each and every column of strings, clobbering the last one you created. It’d probably make sense to just specify which column you want to one-hot encode as a second argument to the function.

You may also be interested in the fact that there are one hot encoders already written:

https://fluxml.ai/Flux.jl/stable/data/onehot/

1 Like