Nested Task error: Bad window span - what?

hi folks,
I am trying to add a column to a df i already have populated, and on a smaller batch size it works just fine, example code below:

Timeddf = @linq dfData |>
          groupby(:ID) |>
          transform(Trailing12 = running(prod, (:Growth .+1), 12).-1) 

using Dataframes & Rollingfunctions, so when the dfData is about 2000 rows, no issues, code executes and does what is expected, but when I add all of the IDs, dfData grows to about 100k rows, and the above code throws out an error:

nested task error: 
        Bad window span (12) for length 10.

which seems to originate in splitapplycombine.jl part of Dataframes package, but I cant quite figure out why - is it some soft of memory limit I am hitting with how dataframes can be used in Julia?

Thanks!

edit: issue seems to happen only when df gets over 8000 rows or so, and then the error starts showing up…

Could you post some runnable code so people can reproduce this on their own machines? Code to make some fake data in dfData and to include all the modules you’re using would be ideal.

struggling to reproduce the error with random data df…

using DataFrames, Query, and RollingFunctions packages
Query for the linq function there to read the df, DataFrames to actually store a query as a df, and RollingFunctions package for the “running” method.

the below works just fine unfortunately, so starting to think this isnt a pure size issue…

bigDf = DataFrame(i = 1:1000000, Growth = rand(1000000), Categories = rand(["a", "b", "c", "d", "f", "g"], 1000000))

TimedDfTest = @linq bigDf |>
         groupby(:Categories) |>
         transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)

yet when using actual data in similar way, it fails whenever the df grows over ~8000 rows. And spits out that error message about Bad window span.

Think i know whats going on…

Timeddf = @linq dfData |>
          groupby(:ID) |>
          transform(Trailing12 = running(prod, (:Growth .+1), 12).-1) 

the 12 in there, there are ID categories that only have up to 10 rows, so it gets broken the moment it tries to do 12 passes where only 10 rows exist for the groupby category.
Hmmm… is there a simple way to tell it to ignore those? alternatively I guess i can filter out original df to ensure those dont occur.

try this way (I don’t know if simple and efficient)

the scheme used here


using ShiftedArrays, DataFrames,RollingFunctions, Query


bigDf = DataFrame(i = 1:100, Growth = rand(100), Categories = rand(["a", "b", "c", "d", "f", "g"], 100))

TimedDfTest = @linq bigDf |>
         groupby(:Categories) |>
         transform(Trailing = running(prod, (:Growth .+1), 3).-1)


runshp(arr, sh) =cumprod(arr)./ShiftedArray(cumprod(arr),sh,default=1)
       
withOnlySA=transform(groupby(bigDf, :Categories),:Growth => (x -> runshp(x.+1, 3).-1) => :Trailing)       

considera che


julia> withOnlySA.Trailing ≈   TimedDfTest.Trailing
true

julia> runshp([1,2,3,4],5)
4-element Vector{Float64}:
  1.0
  2.0
  6.0
 24.0

You could just say

Trailing12 = begin 
    if length(:Growth) > 12 
        fill(missing, length(:Growth))
    else 
        running(prod, ...)
    end
end

for BigDf is better this:


runshp1(arr,sh)=prod.(getindex.([arr],(:).(ShiftedArray(1:length(arr),sh-1,default=1),1:length(arr))))

withOnlySA=transform(groupby(bigDf, :Categories),:Growth => (x -> runshp1(x.+1, 3).-1) => :Trailing)