Nested Task error: Bad window span - what?

SubTer · September 9, 2021, 9:57pm

hi folks,
I am trying to add a column to a df i already have populated, and on a smaller batch size it works just fine, example code below:

Timeddf = @linq dfData |>
          groupby(:ID) |>
          transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)

using Dataframes & Rollingfunctions, so when the dfData is about 2000 rows, no issues, code executes and does what is expected, but when I add all of the IDs, dfData grows to about 100k rows, and the above code throws out an error:

nested task error: 
        Bad window span (12) for length 10.

which seems to originate in splitapplycombine.jl part of Dataframes package, but I cant quite figure out why - is it some soft of memory limit I am hitting with how dataframes can be used in Julia?

Thanks!

edit: issue seems to happen only when df gets over 8000 rows or so, and then the error starts showing up…

c42f · September 10, 2021, 1:16am

Could you post some runnable code so people can reproduce this on their own machines? Code to make some fake data in dfData and to include all the modules you’re using would be ideal.

SubTer · September 10, 2021, 2:15am

struggling to reproduce the error with random data df…

using DataFrames, Query, and RollingFunctions packages
Query for the linq function there to read the df, DataFrames to actually store a query as a df, and RollingFunctions package for the “running” method.

the below works just fine unfortunately, so starting to think this isnt a pure size issue…

bigDf = DataFrame(i = 1:1000000, Growth = rand(1000000), Categories = rand(["a", "b", "c", "d", "f", "g"], 1000000))

TimedDfTest = @linq bigDf |>
         groupby(:Categories) |>
         transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)

yet when using actual data in similar way, it fails whenever the df grows over ~8000 rows. And spits out that error message about Bad window span.

SubTer · September 10, 2021, 2:32am

Think i know whats going on…

Timeddf = @linq dfData |>
          groupby(:ID) |>
          transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)

the 12 in there, there are ID categories that only have up to 10 rows, so it gets broken the moment it tries to do 12 passes where only 10 rows exist for the groupby category.
Hmmm… is there a simple way to tell it to ignore those? alternatively I guess i can filter out original df to ensure those dont occur.

rocco_sprmnt21 · September 13, 2021, 7:26pm

try this way (I don’t know if simple and efficient)

the scheme used here


using ShiftedArrays, DataFrames,RollingFunctions, Query


bigDf = DataFrame(i = 1:100, Growth = rand(100), Categories = rand(["a", "b", "c", "d", "f", "g"], 100))

TimedDfTest = @linq bigDf |>
         groupby(:Categories) |>
         transform(Trailing = running(prod, (:Growth .+1), 3).-1)


runshp(arr, sh) =cumprod(arr)./ShiftedArray(cumprod(arr),sh,default=1)
       
withOnlySA=transform(groupby(bigDf, :Categories),:Growth => (x -> runshp(x.+1, 3).-1) => :Trailing)

considera che


julia> withOnlySA.Trailing ≈   TimedDfTest.Trailing
true

julia> runshp([1,2,3,4],5)
4-element Vector{Float64}:
  1.0
  2.0
  6.0
 24.0

pdeffebach · September 13, 2021, 7:33pm

You could just say

Trailing12 = begin 
    if length(:Growth) > 12 
        fill(missing, length(:Growth))
    else 
        running(prod, ...)
    end
end

rocco_sprmnt21 · September 13, 2021, 8:25pm

for BigDf is better this:


runshp1(arr,sh)=prod.(getindex.([arr],(:).(ShiftedArray(1:length(arr),sh-1,default=1),1:length(arr))))

withOnlySA=transform(groupby(bigDf, :Categories),:Growth => (x -> runshp1(x.+1, 3).-1) => :Trailing)

Topic		Replies	Views
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9518	September 11, 2021
RollingFunctions with a variable window width New to Julia dataframes	19	1353	October 8, 2022
Variable sized windows for moving average Statistics question , rolling , speed-optimization	8	687	June 4, 2024
Stack overflow in DataFrames group by Data	16	4029	October 15, 2017
DataFrames operation scales badly Performance	21	2715	December 10, 2018

Nested Task error: Bad window span - what?

Related topics