Dummy Encoding(One hot encoding) from PooledDataArray

question

#1

I would like to know how to convert Pooled Data array into 0/1 columns similar to Sklearn OneHotEncoder in python. Following is the data frame which i am working with in which Country and Purchased are pooled data array.

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

> pool!(data1csv, [:Country, :Purchased])

Kindly let me know how to go about converting pooled data array into dummy encoded columns

Thank You


#2

I’m not familiar with OneHotEncoder, but a ModelMatrix from the DataFrames package is probably what you’re looking for. There is some documentation here:

https://juliastats.github.io/DataFrames.jl/stable/man/formulas/


#3

Thank you for the response. I did go through the documentation and was able to create it with dummy encoding as below

>mm = ModelMatrix(ModelFrame(@formula(Purchased ~ Age + Salary + Country), data1csv, contrasts = Dict(:Purchased => DummyCoding(), :Country => DummyCoding())))

A ModelFrame object is just a simple wrapper around a DataFrame. For modeling purposes, one generally wants to construct a ModelMatrix, which constructs a Matrix{Float64} that can be used directly to fit a statistical model:

Document does say that it can be used to fit statistical model. But i am not sure how to even access the modelmatrix so that i can apply normalization function to it. Tried to search the web for information but unable to find the further document related to this? Please point me in the right direction if you are aware of it?

Thank You.


#4

I was able to figure out how to get the matrix from model matrix. I just had to use m property from the object to get the matrix.
> mm.m

10×5 Array{Float64,2}:
 1.0  44.0  72000.0  0.0  0.0
 1.0  27.0  48000.0  0.0  1.0
 1.0  30.0  54000.0  1.0  0.0
 1.0  38.0  61000.0  0.0  1.0
 1.0  40.0  63777.0  1.0  0.0
 1.0  35.0  58000.0  0.0  0.0
 1.0  38.0  52000.0  0.0  1.0
 1.0  48.0  79000.0  0.0  0.0
 1.0  50.0  83000.0  1.0  0.0
 1.0  37.0  67000.0  0.0  0.0

After i apply normalization function to Age and Salary. How do i get back the original fame. As Country Feature is Dummy Encoded. How to associate a observation to respective country(France, Germany, Spain).
Is there a way to do it?


#5

The simplest way may be to write a function to unpool a PooledDataArray. The following unpool function returns a DataArray, which you can populate to your original dataframe: df[:unpooled] = unpool(df,:pooled).

function unpool(df::DataFrame,varname::Symbol)
    if isa(df[varname],PooledDataArray) == false
        error(varname," is not a PooledDataArray")
    end

    da = DataArray(eltype(df[varname].pool),size(df,1))
    pool = df[varname].pool
    refs = df[varname].refs
    for i = 1:size(df,1)
        da[i] = refs[i] == 0 ? NA : pool[refs[i]]
    end
    return da
end

#6

It doesn’t seem you should used model matrices, as IIUC you want a data frame result rather than a matrix. A loop should be enough:

for c in unique(df[:Country])
    df[Symbol(c)] = df[:Country] .== c
end

#7

@mwsohn thank you. will try it out


#8

@nalimilan thank you for the response. You solution works perfectly. But if i need to convert true to 1 false to 0. Is there a efficient way to go about it as opposed to creating a new Float64/Int64 Column and deleting the Bool Column.
I tried the following. As the new are columns Bool, I am unable to assign True to 1 and False to 0.

> for c in unique(df[:Country])
     df[Symbol(c)] = df[:Country] .== c
   
     for i in 1:size(df[Symbol(c)], 1)
       if df[i, Symbol(c)]
         df[i, Symbol(c)] = 1.0
       else
         df[i, Symbol(c)] = 0.0
       end
     end
end

If there is any alternative kindly suggest me.

Thank You.


#9

@nalimilan Thank You very much. I believe i was able to work out one the way to achieve the desired result

for c in unique(df[:Country])
    #df[Symbol(c)] = df[:Country] .== c
    df[Symbol(c)] = ones(Float64, size(df,1))
    for i in 1:size(df[:Country],1)
      if c == df[i,:Country]
        df[i,Symbol(c)] = 1.0
      else
        df[i,Symbol(c)] = 0.0
      end
    end
end

Kindly let me know if there is any better way to do it. Thank you.


#10

Just do this:

for c in unique(df[:Country])
    df[Symbol(c)] = UInt.(df[:Country] .== c)
end

or

for c in unique(df[:Country])
    df[Symbol(c)] = ifelse.(df[:Country] .== c, 1, 0)
end

The dot vectorized syntax ensures that no temporary vector will be created. But you can also keep the column as Bool as in many operations it will behave as expected: false * 2 == 0.


#11

@nalimilan Thank You very very much. esp for the below tip.