I would like to know how to convert Pooled Data array into 0/1 columns similar to Sklearn OneHotEncoder in python. Following is the data frame which i am working with in which Country and Purchased are pooled data array.
10×4 DataFrames.DataFrame
│ Row │ Country │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1 │ "France" │ 44 │ 72000 │ "No" │
│ 2 │ "Spain" │ 27 │ 48000 │ "Yes" │
│ 3 │ "Germany" │ 30 │ 54000 │ "No" │
│ 4 │ "Spain" │ 38 │ 61000 │ "No" │
│ 5 │ "Germany" │ 40 │ NA │ "Yes" │
│ 6 │ "France" │ 35 │ 58000 │ "Yes" │
│ 7 │ "Spain" │ NA │ 52000 │ "No" │
│ 8 │ "France" │ 48 │ 79000 │ "Yes" │
│ 9 │ "Germany" │ 50 │ 83000 │ "No" │
│ 10 │ "France" │ 37 │ 67000 │ "Yes" │
> pool!(data1csv, [:Country, :Purchased])
Kindly let me know how to go about converting pooled data array into dummy encoded columns
Thank You
I’m not familiar with OneHotEncoder
, but a ModelMatrix
from the DataFrames
package is probably what you’re looking for. There is some documentation here:
https://juliastats.github.io/DataFrames.jl/stable/man/formulas/
1 Like
Thank you for the response. I did go through the documentation and was able to create it with dummy encoding as below
>mm = ModelMatrix(ModelFrame(@formula(Purchased ~ Age + Salary + Country), data1csv, contrasts = Dict(:Purchased => DummyCoding(), :Country => DummyCoding())))
A ModelFrame object is just a simple wrapper around a DataFrame. For modeling purposes, one generally wants to construct a ModelMatrix, which constructs a Matrix{Float64} that can be used directly to fit a statistical model:
Document does say that it can be used to fit statistical model. But i am not sure how to even access the modelmatrix so that i can apply normalization function to it. Tried to search the web for information but unable to find the further document related to this? Please point me in the right direction if you are aware of it?
Thank You.
I was able to figure out how to get the matrix from model matrix. I just had to use m property from the object to get the matrix.
> mm.m
10×5 Array{Float64,2}:
1.0 44.0 72000.0 0.0 0.0
1.0 27.0 48000.0 0.0 1.0
1.0 30.0 54000.0 1.0 0.0
1.0 38.0 61000.0 0.0 1.0
1.0 40.0 63777.0 1.0 0.0
1.0 35.0 58000.0 0.0 0.0
1.0 38.0 52000.0 0.0 1.0
1.0 48.0 79000.0 0.0 0.0
1.0 50.0 83000.0 1.0 0.0
1.0 37.0 67000.0 0.0 0.0
After i apply normalization function to Age and Salary. How do i get back the original fame. As Country Feature is Dummy Encoded. How to associate a observation to respective country(France, Germany, Spain).
Is there a way to do it?
The simplest way may be to write a function to unpool
a PooledDataArray. The following unpool
function returns a DataArray, which you can populate to your original dataframe: df[:unpooled] = unpool(df,:pooled).
function unpool(df::DataFrame,varname::Symbol)
if isa(df[varname],PooledDataArray) == false
error(varname," is not a PooledDataArray")
end
da = DataArray(eltype(df[varname].pool),size(df,1))
pool = df[varname].pool
refs = df[varname].refs
for i = 1:size(df,1)
da[i] = refs[i] == 0 ? NA : pool[refs[i]]
end
return da
end
It doesn’t seem you should used model matrices, as IIUC you want a data frame result rather than a matrix. A loop should be enough:
for c in unique(df[:Country])
df[Symbol(c)] = df[:Country] .== c
end
1 Like
@mwsohn thank you. will try it out
@nalimilan thank you for the response. You solution works perfectly. But if i need to convert true to 1 false to 0. Is there a efficient way to go about it as opposed to creating a new Float64/Int64 Column and deleting the Bool Column.
I tried the following. As the new are columns Bool, I am unable to assign True to 1 and False to 0.
> for c in unique(df[:Country])
df[Symbol(c)] = df[:Country] .== c
for i in 1:size(df[Symbol(c)], 1)
if df[i, Symbol(c)]
df[i, Symbol(c)] = 1.0
else
df[i, Symbol(c)] = 0.0
end
end
end
If there is any alternative kindly suggest me.
Thank You.
@nalimilan Thank You very much. I believe i was able to work out one the way to achieve the desired result
for c in unique(df[:Country])
#df[Symbol(c)] = df[:Country] .== c
df[Symbol(c)] = ones(Float64, size(df,1))
for i in 1:size(df[:Country],1)
if c == df[i,:Country]
df[i,Symbol(c)] = 1.0
else
df[i,Symbol(c)] = 0.0
end
end
end
Kindly let me know if there is any better way to do it. Thank you.
Just do this:
for c in unique(df[:Country])
df[Symbol(c)] = UInt.(df[:Country] .== c)
end
or
for c in unique(df[:Country])
df[Symbol(c)] = ifelse.(df[:Country] .== c, 1, 0)
end
The dot vectorized syntax ensures that no temporary vector will be created. But you can also keep the column as Bool
as in many operations it will behave as expected: false * 2 == 0
.
2 Likes
@nalimilan Thank You very very much. esp for the below tip.