How to compute a "cumulative" in a dataframe (without a for loop)

No it is definitely not a problem with Julia’s array implementation.

Can you post an MWE? It’s not clear to me when it works and when it doesn’t. Is the issue when you add groups?

That’s been my issue, i cant replicate this issue on random df created within julia.
Aside from this silly example below - which I get why it fails, kinda, as I am asking it to pass a window size of 2 to a new vector created by grouping “i” but those are only going to be a vector of 1, since i made i = 1:100, so each group will only have one.

bigDf = DataFrame(i = 1:100, Growth = rand(100), Categories = rand(["a", "b", "c", "d", "f", "g"], 100))

TimedDfTest = @linq bigDf |>
              groupby(:i) |>
               transform(Trailing12 = running(prod, (:Growth .+1), 2).-1)

in my actual data, my df runs through pretty similar above code, works fine. I add more data to the df, by simply including more IDs by which the code groups it. NO CODE CHANGE. And then it starts having bad window span errors, even though it’s certainly not being passed smaller vectors than the window sizes. Which has left me very puzzled.
(I know it’s not the new ID added to the df that causes the issue, as I’ve just made a df of that ID, and bunch others, basically a very small df row size wise, and the same identical code cranks through it without Bad Window span errors)
Thus starting to wonder if it’s something odd with Julia arrays themselves.

end up with this error message below which i havent been able to make sense of:

Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:601

Stacktrace:
 [1] wait
   @ .\task.jl:317 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:597

    nested task error: 
        Bad window span (12) for length 10.

    Stacktrace:
      [1] nrolled
        @ ~\.julia\packages\RollingFunctions\4Jh9c\src\support.jl:20 [inlined]
      [2] running(fun::Function, data::Vector{Float64}, windowspan::Int64)
        @ RollingFunctions ~\.julia\packages\RollingFunctions\4Jh9c\src\run\running.jl:8
      [3] (::var"#13#15")(261::SubArray{Union{Missing, Float64}, 1, Vector{Union{Missing, Float64}}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})
        @ Main ~\.julia\packages\DataFramesMeta\mHJrB\src\parsing.jl:200
      [4] do_call(f::var"#13#15", idx::Vector{Int64}, starts::Vector{Int64}, ends::Vector{Int64}, gd::GroupedDataFrame{DataFrame}, incols::Tuple{Vector{Union{Missing, Float64}}}, i::Int64)
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\callprocessing.jl:94
      [5] _combine_tables_with_first!(first::NamedTuple{(:x1,), Tuple{Vector{Float64}}}, outcols::Tuple{Vector{Float64}}, idx::Vector{Int64}, rowstart::Int64, colstart::Int64, f::Function, gd::GroupedDataFrame{DataFrame}, incols::Tuple{Vector{Union{Missing, Float64}}}, colnames::Tuple{Symbol}, firstmulticol::DataFrames.FirstSingleCol)
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\complextransforms.jl:376
      [6] _combine_with_first(::Base.RefValue{Any}, ::Base.RefValue{Any}, gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}, firstmulticol::Bool, idx_agg::Vector{Int64})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\complextransforms.jl:69
      [7] _combine_process_pair_symbol(optional_i::Bool, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}}, out_col_name::Symbol, firstmulticol::Bool, ::Base.RefValue{Any}, wfun::Base.RefValue{Any}, wincols::Base.RefValue{Any})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:357
      [8] _combine_process_pair(::Base.RefValue{Any}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}})
        @ DataFrames ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:498
      [9] macro expansion
        @ ~\.julia\packages\DataFrames\vQokV\src\groupeddataframe\splitapplycombine.jl:589 [inlined]
     [10] (::DataFrames.var"#614#620"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Vector{Int64}}, Bool, Pair{Int64, Pair{var"#13#15", Symbol}}})()
        @ DataFrames .\threadingconstructs.jl:169

The error is a lot scarier than it needs to be because DataFrames does some multi-threading during the combine call. This makes for weird errors but is very unlikely to be the source of your issue.

Can you do

@chain df begin 
    groupby(group_var)
    combine(nrow)
    describe
end

and post the results?

hmmm interesting… it does look like something sneaks in that has a row count of 10… well at least the arrays arent crazy…i am.

df that works:

2×7 DataFrame
 Row │ variable     mean        min     median      max     nmissing  eltype
     │ Symbol       Float64     Signed  Float64     Signed  Int64     Type
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │ xxxxxID       6.29382e5  600339   6.33712e5  634550         0  Union{Missing, Int32}
   2 │ nrow         58.9706         50  60.0            60         0  Int64

df that fails:

2×7 DataFrame
 Row │ variable     mean        min     median    max     nmissing  eltype
     │ Symbol       Float64     Signed  Float64   Signed  Int64     Type
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │ xxxxxID	     6.29566e5  600339  633750.0  634628         0  Union{Missing, Int32}
   2 │ nrow         58.539          10      60.0      60         0  Int64

guess that leads to next question - can you use @chain count functions to mass remove anything that counts below window span in the transform function, all in the same linq call?

yeah!

@chain df begin 
    groupby(:g)
    @transform :t = length(:g)
    @subset :t .> 10 
end
2 Likes