Dummy Encoding(One hot encoding) from PooledDataArray

Saran_S · June 9, 2017, 2:12am

I would like to know how to convert Pooled Data array into 0/1 columns similar to Sklearn OneHotEncoder in python. Following is the data frame which i am working with in which Country and Purchased are pooled data array.

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

> pool!(data1csv, [:Country, :Purchased])

Kindly let me know how to go about converting pooled data array into dummy encoded columns

Thank You

jkbest2 · June 9, 2017, 4:34am

I’m not familiar with OneHotEncoder, but a ModelMatrix from the DataFrames package is probably what you’re looking for. There is some documentation here:

https://juliastats.github.io/DataFrames.jl/stable/man/formulas/

Saran_S · June 9, 2017, 5:23am

Thank you for the response. I did go through the documentation and was able to create it with dummy encoding as below

>mm = ModelMatrix(ModelFrame(@formula(Purchased ~ Age + Salary + Country), data1csv, contrasts = Dict(:Purchased => DummyCoding(), :Country => DummyCoding())))

A ModelFrame object is just a simple wrapper around a DataFrame. For modeling purposes, one generally wants to construct a ModelMatrix, which constructs a Matrix{Float64} that can be used directly to fit a statistical model:

Document does say that it can be used to fit statistical model. But i am not sure how to even access the modelmatrix so that i can apply normalization function to it. Tried to search the web for information but unable to find the further document related to this? Please point me in the right direction if you are aware of it?

Thank You.

Saran_S · June 9, 2017, 6:45am

I was able to figure out how to get the matrix from model matrix. I just had to use m property from the object to get the matrix.
> mm.m

10×5 Array{Float64,2}:
 1.0  44.0  72000.0  0.0  0.0
 1.0  27.0  48000.0  0.0  1.0
 1.0  30.0  54000.0  1.0  0.0
 1.0  38.0  61000.0  0.0  1.0
 1.0  40.0  63777.0  1.0  0.0
 1.0  35.0  58000.0  0.0  0.0
 1.0  38.0  52000.0  0.0  1.0
 1.0  48.0  79000.0  0.0  0.0
 1.0  50.0  83000.0  1.0  0.0
 1.0  37.0  67000.0  0.0  0.0

After i apply normalization function to Age and Salary. How do i get back the original fame. As Country Feature is Dummy Encoded. How to associate a observation to respective country(France, Germany, Spain).
Is there a way to do it?

mwsohn · June 9, 2017, 7:27am

The simplest way may be to write a function to unpool a PooledDataArray. The following unpool function returns a DataArray, which you can populate to your original dataframe: df[:unpooled] = unpool(df,:pooled).

function unpool(df::DataFrame,varname::Symbol)
    if isa(df[varname],PooledDataArray) == false
        error(varname," is not a PooledDataArray")
    end

    da = DataArray(eltype(df[varname].pool),size(df,1))
    pool = df[varname].pool
    refs = df[varname].refs
    for i = 1:size(df,1)
        da[i] = refs[i] == 0 ? NA : pool[refs[i]]
    end
    return da
end

nalimilan · June 9, 2017, 9:47am

It doesn’t seem you should used model matrices, as IIUC you want a data frame result rather than a matrix. A loop should be enough:

for c in unique(df[:Country])
    df[Symbol(c)] = df[:Country] .== c
end

Saran_S · June 9, 2017, 3:12pm

@mwsohn thank you. will try it out

Saran_S · June 9, 2017, 3:19pm

@nalimilan thank you for the response. You solution works perfectly. But if i need to convert true to 1 false to 0. Is there a efficient way to go about it as opposed to creating a new Float64/Int64 Column and deleting the Bool Column.
I tried the following. As the new are columns Bool, I am unable to assign True to 1 and False to 0.

> for c in unique(df[:Country])
     df[Symbol(c)] = df[:Country] .== c
   
     for i in 1:size(df[Symbol(c)], 1)
       if df[i, Symbol(c)]
         df[i, Symbol(c)] = 1.0
       else
         df[i, Symbol(c)] = 0.0
       end
     end
end

If there is any alternative kindly suggest me.

Thank You.

Saran_S · June 9, 2017, 4:30pm

@nalimilan Thank You very much. I believe i was able to work out one the way to achieve the desired result

for c in unique(df[:Country])
    #df[Symbol(c)] = df[:Country] .== c
    df[Symbol(c)] = ones(Float64, size(df,1))
    for i in 1:size(df[:Country],1)
      if c == df[i,:Country]
        df[i,Symbol(c)] = 1.0
      else
        df[i,Symbol(c)] = 0.0
      end
    end
end

Kindly let me know if there is any better way to do it. Thank you.

nalimilan · June 9, 2017, 5:00pm

Just do this:

for c in unique(df[:Country])
    df[Symbol(c)] = UInt.(df[:Country] .== c)
end

or

for c in unique(df[:Country])
    df[Symbol(c)] = ifelse.(df[:Country] .== c, 1, 0)
end

The dot vectorized syntax ensures that no temporary vector will be created. But you can also keep the column as Bool as in many operations it will behave as expected: false * 2 == 0.

Saran_S · June 9, 2017, 5:10pm

@nalimilan Thank You very very much. esp for the below tip.

Topic		Replies	Views
Tanslate code from PooledDataArray to CategoricalArray General Usage	3	793	January 29, 2018
CSV.read: why do String columns show up as PooledArrays? New to Julia question	6	1096	October 30, 2019
Generating dummy variables from a vector of strings (one-hot encoding) New to Julia	9	3255	July 31, 2021
Converting (combined) Pooled DataFrame to Matrix General Usage question , dataframes , convert	8	1474	September 2, 2021
All the ways to do one-hot encoding General Usage	30	11226	October 20, 2024

Dummy Encoding(One hot encoding) from PooledDataArray

Related topics