Equivalent to Pandas "cut" in Julia DataFrames?

Pretty self-explanatory.

I have two arrays, x and y. I would like to bin the values of x into linearly-spaced bins and add the corresponding y values whose corresponding x falls into the same bin.

I haven’t found an analogous function in the DataFrames.jl functions glossary Functions · DataFrames.jl

CategoricalArrays.cut:

https://categoricalarrays.juliadata.org/stable/apiindex/#CategoricalArrays.cut-Tuple{AbstractArray,%20AbstractVector}

On a more general note, the philosophy of DataFrames is quite different from pandas. Because of Julia’s composability, DataFrames only implements functionality which is actually directly relevant to a DataFrame (as opposed to, say, any old vector like cut), with other functionality coming from relevant packages - CSV reading is not DataFrames.read_csv but CSV.read from the CSV package, computing a rolling mean is RollingFunctions.rollmean, imputing missing values by carrying the last nonmissing forward is Impute.locf etc. etc.

This takes a bit of getting used to when coming from pandas but tends to have the upside that the specialised packages are much more powerful than some bolted-on functionality that’s not actually core to the DataFrames package.

7 Likes

Thanks a ton for taking the time to reply. I will try to solve my problem with CategoricalArrays.jl and will post the solution once I have it.

Thanks also for the general comment! In my attempts to write software with the least amount of dependencies I often neglect looking at already-existing modules. Hence why I did not know about CategoricalArrays.jl for example…

using DataFrames
using CategoricalArrays

x = [1.2, 2.5, 3.1, 4.8, 5.2, 6.3, 7.7, 2.0, 3.8, 4.2, 5.7, 6.1, 7.4, 8.9]
y = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 15.0, 25.0, 35.0, 45.0, 55.0, 65.0, 75.0]

df = DataFrame(x=x, y=y)

# Define bin size and bin edges
bin_size = 2.0
min_x, max_x = extrema(df.x)
bins = min_x:bin_size:max_x

# Create a new column with the bin labels
df.bin_labels = cut(df.x, bins, extend=true)

# Group by the bin labels and calculate the sum of y-coordinates
grouped_df = combine(groupby(df, :bin_labels), :y => sum, renamecols=false)

The above yields the expected result:

4×2 DataFrame
 Row │ bin_labels  y
     │ Cat…        Float64
─────┼─────────────────────
   1 │ [1.2, 3.2)     75.0
   2 │ [3.2, 5.2)    100.0
   3 │ [5.2, 7.2)    210.0
   4 │ [7.2, 8.9]    210.0

A following question would be, how can I now e.g. plot the binned x column against the binned y?
I guess one has to choose values within a range for each binned x, like e.g. (3.2-1.2)/2 = 1.0 for the first point and so on (choosing the middle of the range/bin for example).

Is there a programmatic way of doing this that I’ve missed in my 30 min. introduction to CategoricalArrays.jl? :face_with_open_eyes_and_hand_over_mouth:

1 Like

Naturally, one can do the following:

using Plots
scatter(df.x, df.y)
scatter!(bins, grouped_df.y)

To plot the original and binned dataframes.