CategoricalArrays Binning

When using the cut function of CategoricalArrays, it returns CategoricalArray objects which contain CategoricalValue objects like [0,1). However, I don’t seem to understand how to access the value that was cut, for example, if I wanted the midpoint of the bin, I would run a mean over the two numbers, but since they are essentially strings, I can’t do that. Is there some way to do this that I don’t know about?

You cannot by default, unless you encode it using a custom formatted. Maybe for your use case you want to use histogram instead?

I don’t know how Histogram would work, because I need to create the bins, and then get the mean and standard deviation of another variable in each of the bins.

If you do not know how to use histogram the simplest thing is to do the following (I show it to you by example):

julia> using DataFrames, CategoricalArrays, Statistics

julia> df = DataFrame(rand(10, 2), [:ref, :other])
10×2 DataFrame
 Row │ ref       other
     │ Float64   Float64
─────┼─────────────────────
   1 │ 0.153328  0.464341
   2 │ 0.794552  0.687084
   3 │ 0.860548  0.66624
   4 │ 0.252829  0.199308
   5 │ 0.709457  0.981467
   6 │ 0.2814    0.355272
   7 │ 0.819591  0.413013
   8 │ 0.575109  0.169053
   9 │ 0.551803  0.0971433
  10 │ 0.540336  0.64679

julia> df.bin = cut(df.ref, 3);

julia> df
10×3 DataFrame
 Row │ ref       other      bin
     │ Float64   Float64    Cat…
─────┼────────────────────────────────────────────────────────
   1 │ 0.153328  0.464341   Q1: [0.15332816708897712, 0.5403…
   2 │ 0.794552  0.687084   Q3: [0.7094574019643611, 0.86054…
   3 │ 0.860548  0.66624    Q3: [0.7094574019643611, 0.86054…
   4 │ 0.252829  0.199308   Q1: [0.15332816708897712, 0.5403…
   5 │ 0.709457  0.981467   Q3: [0.7094574019643611, 0.86054…
   6 │ 0.2814    0.355272   Q1: [0.15332816708897712, 0.5403…
   7 │ 0.819591  0.413013   Q3: [0.7094574019643611, 0.86054…
   8 │ 0.575109  0.169053   Q2: [0.5403359070225858, 0.70945…
   9 │ 0.551803  0.0971433  Q2: [0.5403359070225858, 0.70945…
  10 │ 0.540336  0.64679    Q2: [0.5403359070225858, 0.70945…

julia> combine(groupby(df, :bin), :ref .=> [minimum, maximum, mean], :other .=> [mean, std])
3×6 DataFrame
 Row │ bin                                ref_minimum  ref_maximum  ref_mean  other_mean  other_std
     │ Cat…                               Float64      Float64      Float64   Float64     Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Q1: [0.15332816708897712, 0.5403…     0.153328     0.2814    0.229186    0.339641   0.133206
   2 │ Q2: [0.5403359070225858, 0.70945…     0.540336     0.575109  0.555749    0.304329   0.298752
   3 │ Q3: [0.7094574019643611, 0.86054…     0.709457     0.860548  0.796037    0.686951   0.23253
2 Likes