Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates?

phantom · March 24, 2023, 8:13pm

Thanks! Sorry so is the following a correct understanding of when type declarations improve performance?

If a function takes a column of a DataFrame as an argument then type declarations do not improve performance.
If a function takes an entire DataFrame as an argument then type declarations on each column will improve performance.

The docs on argument type declarations note that they generally do not enhance performance. But as you pointed out in a previous post about efficiency in iterating over columns of a DataFrame that:

Is there an equivalent of eachindex() for DataFrames?

The point is that in order to be efficient you must pass a column to a separate function. Then inside this function all will be fast.

The reason is that DataFrame object is not type stable, so for example even:
for col in eachcol(df)
    for v in col
        ... your code
    end
end
will be slow, because Julia does not know the element type of col at compilation time.

and also that

Is there an equivalent of eachindex() for DataFrames?

If all columns have the same type then what is enough is:
for col in eachcol(df)
    for v in col::Vector{String} # assuming this is the type of column
        ... your code
    end
end
of course converting to a Matrix or to Tables.columntable also will work in this case.

My background is so minimal that I wasn’t sure whether type declarations here improved performance because the loop was entirely outside of a function or because the entire DataFrame was being passed as an argument into a function?

Assuming the latter, why does a function barrier work for a column of a DataFrame but not the entire DataFrame? Wouldn’t the type stability issue affect both the DataFrame object and the DataFrame col?

Topic		Replies	Views
Create grouped dataframe by properties of a given column? New to Julia dataframes , grouped-data	9	392	April 26, 2024
Creating User-Defined Grouped DataFrames Data first-steps , data , dataframes , time-series	1	516	July 9, 2021
Groupby on an expression or a vector? New to Julia	21	551	June 11, 2024
Faster groupwise joins to complete implicitly missing rows Performance dataframes	2	333	April 16, 2021
Grouping by values in either of two columns Data question	13	783	April 14, 2024

Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates?

Related topics