Discretize/Binning of Continuous Variable in Dataframe

Hello There,

I am just wondering what is the best way to add a new Categorical/String column to existing dataframe based on a continuous value column?

‘using DataFrames
df = DataFrame(Types = [“SUV”, “SUV”,“SUV”, “SUV”, “sedan”,“sedan”], models=[“Q3”,“Q5”, “Kluger”, “Land Cruiser”, “Corolla”, “F40”], acceleration = [11,8, 8, 19, 5.5,3.3,])’

6×3 DataFrame
Row │ Types models acceleration
│ String String Float64
─────┼────────────────────────────────────
1 │ SUV Q3 11.0
2 │ SUV Q5 8.0
3 │ SUV Kluger 8.0
4 │ SUV Land Cruiser 19.0
5 │ sedan Corolla 5.5
6 │ sedan F40 3.3

Want:
Acceleration less than 6: “Fast”
Acceleration greater than 10 “Slow”
else “Normal”

sorry, i know this is most basic, and in Python, i would have used .apply of a custom function. I did a 20 min search on Goggle but couldn’t find anything.

Thanks!

CategoricalArrays.cut Using CategoricalArrays · CategoricalArrays

3 Likes

Hi @Nelson_Chow

I would typically use map

julia> c = map(x -> x<6 ? "Fast" : (x>10 ? "Slow" : "Normal"), 
               df[:,"acceleration"])
julia> insertcols!(df,ncol(df)+1,:speed=>c)
6×4 DataFrame
 Row │ Types   models        acceleration  speed  
     │ String  String        Float64       String 
─────┼────────────────────────────────────────────
   1 │ SUV     Q3                    11.0  Slow
   2 │ SUV     Q5                     8.0  Normal
   3 │ SUV     Kluger                 8.0  Normal
   4 │ SUV     Land Cruiser          19.0  Slow
   5 │ sedan   Corolla                5.5  Fast
   6 │ sedan   F40                    3.3  Fast

Also, pre-formatted text is very helpful when sharing code :slight_smile: