How to compute a "cumulative" in a dataframe (without a for loop)

pdeffebach · September 10, 2021, 7:45pm

No it is definitely not a problem with Julia’s array implementation.

Can you post an MWE? It’s not clear to me when it works and when it doesn’t. Is the issue when you add groups?

SubTer · September 10, 2021, 8:09pm

That’s been my issue, i cant replicate this issue on random df created within julia.
Aside from this silly example below - which I get why it fails, kinda, as I am asking it to pass a window size of 2 to a new vector created by grouping “i” but those are only going to be a vector of 1, since i made i = 1:100, so each group will only have one.

bigDf = DataFrame(i = 1:100, Growth = rand(100), Categories = rand(["a", "b", "c", "d", "f", "g"], 100))

TimedDfTest = @linq bigDf |>
              groupby(:i) |>
               transform(Trailing12 = running(prod, (:Growth .+1), 2).-1)

in my actual data, my df runs through pretty similar above code, works fine. I add more data to the df, by simply including more IDs by which the code groups it. NO CODE CHANGE. And then it starts having bad window span errors, even though it’s certainly not being passed smaller vectors than the window sizes. Which has left me very puzzled.
(I know it’s not the new ID added to the df that causes the issue, as I’ve just made a df of that ID, and bunch others, basically a very small df row size wise, and the same identical code cranks through it without Bad Window span errors)
Thus starting to wonder if it’s something odd with Julia arrays themselves.

end up with this error message below which i havent been able to make sense of:

Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:601

Stacktrace:
 [1] wait
   @ .\task.jl:317 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:597

    nested task error: 
        Bad window span (12) for length 10.

    Stacktrace:
      [1] nrolled
        @ ~\.julia\packages\RollingFunctions\4Jh9c\src\support.jl:20 [inlined]
      [2] running(fun::Function, data::Vector{Float64}, windowspan::Int64)
        @ RollingFunctions ~\.julia\packages\RollingFunctions\4Jh9c\src\run\running.jl:8
      [3] (::var"#13#15")(261::SubArray{Union{Missing, Float64}, 1, Vector{Union{Missing, Float64}}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})
        @ Main ~\.julia\packages\DataFramesMeta\mHJrB\src\parsing.jl:200
      [4] do_call(f::var"#13#15", idx::Vector{Int64}, starts::Vector{Int64}, ends::Vector{Int64}, gd::GroupedDataFrame{DataFrame}, incols::Tuple{Vector{Union{Missing, Float64}}}, i::Int64)
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\callprocessing.jl:94
      [5] _combine_tables_with_first!(first::NamedTuple{(:x1,), Tuple{Vector{Float64}}}, outcols::Tuple{Vector{Float64}}, idx::Vector{Int64}, rowstart::Int64, colstart::Int64, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Tuple{Vector{Union{Missing, Float64}}}, colnames::Tuple{Symbol}, firstmulticol::DataFrames.FirstSingleCol)
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\complextransforms.jl:376
      [6] _combine_with_first(::Base.RefValue{Any}, ::Base.RefValue{Any}, gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}, firstmulticol::Bool, idx_agg::Vector{Int64})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\complextransforms.jl:69
      [7] _combine_process_pair_symbol(optional_i::Bool, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}}, out_col_name::Symbol, firstmulticol::Bool, ::Base.RefValue{Any}, wfun::Base.RefValue{Any}, wincols::Base.RefValue{Any})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:357
      [8] _combine_process_pair(::Base.RefValue{Any}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:498
      [9] macro expansion
        @ ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:589 [inlined]
     [10] (::DataFrames.var"#614#620"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Vector{Int64}}, Bool, Pair{Int64, Pair{var"#13#15", Symbol}}})()
        @ DataFrames .\threadingconstructs.jl:169

pdeffebach · September 10, 2021, 8:11pm

The error is a lot scarier than it needs to be because DataFrames does some multi-threading during the combine call. This makes for weird errors but is very unlikely to be the source of your issue.

Can you do

@chain df begin 
    groupby(group_var)
    combine(nrow)
    describe
end

and post the results?

SubTer · September 10, 2021, 8:27pm

hmmm interesting… it does look like something sneaks in that has a row count of 10… well at least the arrays arent crazy…i am.

df that works:

2×7 DataFrame
 Row │ variable     mean        min     median      max     nmissing  eltype
     │ Symbol       Float64     Signed  Float64     Signed  Int64     Type
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │ xxxxxID       6.29382e5  600339   6.33712e5  634550         0  Union{Missing, Int32}
   2 │ nrow         58.9706         50  60.0            60         0  Int64

df that fails:

2×7 DataFrame
 Row │ variable     mean        min     median    max     nmissing  eltype
     │ Symbol       Float64     Signed  Float64   Signed  Int64     Type
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │ xxxxxID	     6.29566e5  600339  633750.0  634628         0  Union{Missing, Int32}
   2 │ nrow         58.539          10      60.0      60         0  Int64

guess that leads to next question - can you use @chain count functions to mass remove anything that counts below window span in the transform function, all in the same linq call?

pdeffebach · September 11, 2021, 1:39am

yeah!

@chain df begin 
    groupby(:g)
    @transform :t = length(:g)
    @subset :t .> 10 
end

Topic		Replies	Views
Cumulative sum on rows of a dataframe General Usage question , dataframes	3	1070	October 23, 2021
Cumulative min / cumulative max General Usage	7	2491	June 21, 2019
Create Running Total Columns in a Data Frame for Multiple Variables with Dynamic Column Name Creation Data dataframes , gettingstarted , function	7	1390	September 12, 2020
Creating new dataframe column! General Usage question , dataframes	3	294	April 21, 2021
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2285	August 12, 2021

How to compute a "cumulative" in a dataframe (without a for loop)

Related topics