CSV.read: why do String columns show up as PooledArrays?

Sam_Johnson · October 30, 2019, 3:39pm

I’m opening a simple .CSV file with CSV.read with no parameter modifiers. The returned DataFrame is fine and describe(df) gives me the correct element types, which are all Int64 or Strings.

When I try to onehot-encode the String attributes the function fails because the String attributes are not Strings inside the function. When I examine the attribute with typeof(df.prev_campaign_outcome) I get:

PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}

How has Julia made this decision? Why does describe() return the eltype as String but internally has converted the attribute to PooledArray? How can I force Julia to leave well enough alone?

Why does it seem Julia is trying too hard?

ExpandingMan · October 30, 2019, 3:56pm

First of all, it’s not Julia, it’s CSV.jl.

Second, I’m not sure what you mean by

Indeed, PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}} should have an eltype of String which you should be able to confirm. It ought to work just like any other AbstractVector{String}. Could you be more specific about the problem it’s causing?

As to why CSV.jl constructs columns like this, it is common for tables to have string-valued columns with a number of possible values much smaller than the number of rows. In such cases, it’s much more efficient to use a PooledArray, which is sort of like a simple dict.

Sam_Johnson · October 30, 2019, 4:22pm

Here’s the onehot-encoder function I wrote. It’s now failing by saying the return value “onehot” is not defined.

function onehotenc(df::DataFrame)

    # Loop to develop onehot-encoding
    for col in eachcol(df)
        
        # Can we determine if the column is a String and only encode it if it is?
        if col isa String
            
            # How long is the current column we're going to onehot-encode?
            len = length(col)

            # Save the unique values (or Set) and how many there are for our initial zeros matrix
            vals = unique(col)

            # Set up the Dict key-value pairs to save the Boolean operation results
            # For each val in vals make that val a key (:key) with 
            # and fill that column with Boolean falses to begin with
            dict = Dict(Symbol(val) => falses(len) for val in vals)

            # Once the Dict is populated convert it to a DataFrame
            onehot = DataFrame(dict)

            # Change the value from 0 to 1 in the zero DataFrame based on whether the unique value matches the original DF
            for (i, v) in enumerate(col)
                # At the row and col of the current enumerated col value, set the value to true
                # EX: In the onehot DF at row 1, column :A set equal to true
                # EX: In the onehot DF at row 2, column :B set equal to true
                onehot[i, Symbol(v)] = true
            end
        end
    end
    
    return onehot
    
end

davidanthoff · October 30, 2019, 4:35pm

You’ll want to change if col isa String to if eltype(col) isa String, or even better, if eltype(col) isa AbstractString.

The type of col will never be String because it is a vector of Strings, so you need to test the element type of that vector.

Sam_Johnson · October 30, 2019, 4:48pm

Thanks, David. I blundered that comparison.

Any thoughts on why the function returns “UndefVarError: onehot not defined”?

davidanthoff · October 30, 2019, 4:58pm

Take a look at the scoping part of the documentation. You are defining the onehot variable inside a loop, and then it won’t be visible outside of the loop.

mbauman · October 30, 2019, 5:15pm

Even if that worked with our scope rules, it’d still be a bug. You’re generating a new dictionary for each and every column of strings, clobbering the last one you created. It’d probably make sense to just specify which column you want to one-hot encode as a second argument to the function.

You may also be interested in the fact that there are one hot encoders already written:

https://fluxml.ai/Flux.jl/stable/data/onehot/

Topic		Replies	Views
Issues reading CSV file with array elements General Usage dataframes , csv	4	1780	September 6, 2021
DataFrames: ByRow fails in transform with PooledArrays after CSV.read Data question	6	503	August 6, 2021
Dummy Encoding(One hot encoding) from PooledDataArray General Usage question	10	3185	June 9, 2017
[ANN] CSV.jl 0.7 Release Data	38	5336	July 18, 2020
Reading large csv file Data performance , csv	12	1776	August 30, 2021

CSV.read: why do String columns show up as PooledArrays?

Related topics