Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values

askvorts · November 20, 2024, 8:38pm

I am trying to transform a daily DataFrame (df) into a monthly DataFrame (monthly_df) with end-of-month values. The following code works:

monthly_df = df[findall(date -> date == lastdayofmonth(date), df.date), :]

However, there are missing values and I would like to get last non-missing value of each month. The following code,

monthly_df = combine( groupby(df, month.(df.date)), :date => last => :date, [Symbol(col) => (x -> last(skipmissing(x))) => Symbol(col) for col in names(df, Number)]... )

errors with,

ERROR: ArgumentError: invalid index: #542 of type var"#542#543"

I am sure I am complicating something very simple… Any help would be much appreciated!

pdeffebach · November 20, 2024, 9:12pm

You have two problems

names(df, Number) isn’t going to pick up columns of type Union{Float, Missing}. So you need to have a different strategy. Maybe Not(...) or something along those lines.
The main issue is that skipmissing(x) does not work with last. skipmissing returns a “lazy” iterator. It just loops through non-missing values, but doesn’t check where those non-missing values are. The solution is to use collect

julia> df = @chain begin
           Iterators.product(1:12, 1:30)
           collect
           vec
           DataFrame(month = first.(_), day = last.(_))
           @rtransform begin
               :val1 = rand() < 0.1 ? missing : rand()
               :val2 = rand() < 0.1 ? missing : rand()
           end
       end;

julia> out = @chain df begin
           @groupby :month
           @aside cols = names(df, r"^val")
           combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
       end
12×3 DataFrame
 Row │ month  val1      val2       
     │ Int64  Float64   Float64    
─────┼─────────────────────────────
   1 │     1  0.213515  0.0442133
   2 │     2  0.646935  0.00186216
   3 │     3  0.77759   0.35766
   4 │     4  0.963361  0.455094
   5 │     5  0.292968  0.948964
   6 │     6  0.852407  0.438937
   7 │     7  0.486393  0.415348
   8 │     8  0.545606  0.56915
   9 │     9  0.619751  0.853727
  10 │    10  0.309468  0.316852
  11 │    11  0.385376  0.23983
  12 │    12  0.940781  0.908539

askvorts · November 21, 2024, 1:12am

Thank you so much for pointing me in the right direction!
In the original DataFrame (df) all columns are as you wrote Union{Float64, Missing} type, except the first column (date) that is DateTime type. So I did the following,

out = @chain df begin
    @rtransform :year = year(:date)
    @rtransform :month = month(:date)
    groupby([:year, :month])
    @aside cols = names(df)[2:end] 
    combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
end

(I made these changes without much knowledge, as I’ve never used the DataFramesMeta package before). However now I am getting the following error,

ERROR: BoundsError: attempt to access 0-element Vector{Float64} at index [0]
Stacktrace:
[1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:755
[2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:52
[4] _combine_prepare
@ ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:26     [inlined]
[5] combine(gd::GroupedDataFrame{…}, args::Union{…}; keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:857
[6] top-level scope
@ ~/Papers/ZPaper/julia/etl.jl:156 
caused by: TaskFailedException
Stacktrace:
[1] wait(t::Task)
@ Base ./task.jl:370
[2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:752
[3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)

(truncated)

nilshg · November 21, 2024, 9:26am

That just means you have months with all missing values:

julia> (last ∘ collect ∘ skipmissing)([missing, missing])
ERROR: BoundsError: attempt to access 0-element Vector{Union{}} at index [0]

so you’ll have to thkink about what to do with those. If your aggregation logic becomes too complicated, consider writing a separate function for clarity and applying that, e.g.

function get_last_nonmissing(x)
    if all(ismissing, x)
        return "All missing"
    else
        return last(collect(skipmissing(x)))
    end
end

and then you can do combine(..., cols .=> get_last_nonmissing .=> cols)

pdeffebach · November 21, 2024, 2:50pm

Ugh. These kinds of issues with missing make is very hard to do data cleaning in Julia with messy, real-world data sets.

Fortunately, all the pieces for handling missing values are there, but you have to know about and keep track of a dizzying array of functions to work with missing values effectively.

Missings.jl has the emptymissing function which wraps a function taking in an iterator and returns missing if that iterator is missing.

julia> df = @chain begin
           Iterators.product(1:12, 1:30)
           collect
           vec
           DataFrame(month = first.(_), day = last.(_))
           @rtransform begin
               :val1 = rand() < 0.1 ? missing : rand()
               :val2 = rand() < 0.1 ? missing : rand()
               :val3 = missing
           end
       end;

julia> out = @chain df begin
           @groupby :month
           @aside cols = names(df, r"^val")
           combine(_, cols .=> (t -> emptymissing(last)(collect(skipmissing(t)))) .=> cols)
       end
12×4 DataFrame
 Row │ month  val1       val2       val3    
     │ Int64  Float64    Float64    Missing 
─────┼──────────────────────────────────────
   1 │     1  0.294201   0.0236851  missing 
   2 │     2  0.786612   0.187981   missing 
   3 │     3  0.389113   0.259168   missing 
   4 │     4  0.290554   0.755325   missing 
   5 │     5  0.811688   0.406319   missing 
   6 │     6  0.0819031  0.475326   missing 
   7 │     7  0.826316   0.0263804  missing 
   8 │     8  0.0819761  0.589041   missing 
   9 │     9  0.0136988  0.0561459  missing 
  10 │    10  0.117192   0.0990837  missing 
  11 │    11  0.790855   0.0800563  missing 
  12 │    12  0.244128   0.937939   missing

rocco_sprmnt21 · November 21, 2024, 3:19pm

without any claim to be a valid answer to the OP’s request, but to revive an old post about the effectiveness of the groupby function implementation in the DataFrames package, if d is the vector of dates (not necessarily ordered) one could do the following to have the searched values in the order from January to December:

mmax(n::Int, m::Missing)=n
mmax(n::Missing, m::Int)=m
mmax(n,m) = max(n,m)
maxs=Vector{Union{Missing,Int}}(undef,12)
#maxs=fill(0,12)
for i in eachindex(d)
    maxs[month(d[i])]=mmax(maxs[month(d[i])],day(d[i]))
end
maxs

askvorts · November 21, 2024, 6:00pm

Thank you very much. Both functions:

(t -> emptymissing(last)(collect(skipmissing(t))))

and

function get_last_nonmissing(x)
    if all(ismissing, x)
        return missing
    else
        return last(collect(skipmissing(x)))
    end
end

work like a charm!

Indeed, dealing with missing in Julia is not too pleasant. Although I understand the concept behind the behaviour of missing, in my opinion, it is too strict and becomes unpractical in some occasions.

rocco_sprmnt21 · November 22, 2024, 9:23am

I think everyone agrees that managing “missing” is always a delicate matter and to be seen on a case by case basis.

For example, if there is no month in my data that is missing all its days, the following simple scheme does the trick (*).

combine(groupby(dfymd, [:y,:m]), :d => last)

Providing an extra row and column that informs me that there are missing days

dt=unique(rand(Date(2021, 1, 1):Day(1):Date(2024, 12, 31),900))

dtf=[d in dt ? d : missing for d in    Date(2021, 1, 1):Day(1):Date(2024, 12, 31)]
dff=DataFrame(;dtf)
tr(d)=[year(d), month(d),day(d)]
tr(d::Missing)=[missing,missing,missing]
dfymd=transform(dff, :dtf => (ByRow(tr)=>[:y,:m,:d]))

combine(groupby(dfymd, [:y,:m]), :d => last)

I put it in the following form so that it is better readable

julia> unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)
13×6 DataFrame
 Row │ m        2021     2022     2023     2024     missing 
     │ Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
─────┼──────────────────────────────────────────────────────
   1 │       1       29       26       29       31  missing
   2 │       2       25       28       24       28  missing
   3 │       3       31       29       31       31  missing
   4 │       4       27       30       30       29  missing
   5 │       5       30       31       27       30  missing
   6 │       6       28       30       30       30  missing
   7 │       7       27       21       31       29  missing
   8 │       8       30       29       31       30  missing
   9 │       9       26       28       30       30  missing
  10 │      10       30       30       26       31  missing
  11 │      11       30       28       30       30  missing
  12 │      12       30       28       30       30  missing
  13 │ missing  missing  missing  missing  missing  missing

If there is a month with all the days missing, things change, but not much it seems.

dtfam=copy(dtf)
dtfam[1:31].=missing

dffam=DataFrame(;dtfam)

dfymd=transform(dffam, :dtfam => (ByRow(tr)=>[:y,:m,:d]))

unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)


julia> unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)
13×6 DataFrame
 Row │ m        2021     2022     2023     2024     missing 
     │ Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
─────┼──────────────────────────────────────────────────────
   1 │       2       28       27       27       26  missing
   2 │       3       31       31       31       31  missing
   3 │       4       30       25       27       29  missing
   4 │       5       31       28       30       30  missing
   5 │       6       28       27       28       30  missing
   6 │       7       29       24       31       22  missing
   7 │       8       31       31       31       30  missing
   8 │       9       30       29       26       29  missing
   9 │      10       30       30       29       30  missing
  10 │      11       29       27       29       29  missing
  11 │      12       31       22       31       31  missing
  12 │       1  missing       23       31       31  missing
  13 │ missing  missing  missing  missing  missing  missing

So I wonder what your real situation is.

PS
I also wonder why in this case the month of January 2021 (with all missing) was put at the end of the group

(*) The reason is that the groupby function implicitly filters out missing data, assuming that if there is a missing day, the month and year of that day are also missing.

pdeffebach · November 22, 2024, 1:48pm

That’s not what OP asked for, though.

OP wanted the last non-missing value, hence the use of skipmissing.

So this digression doesn’t help the OP solve their original problem.

rocco_sprmnt21 · November 22, 2024, 2:08pm

My digression basically asks the OP to specify what exactly the problem is: he had no ambition to answer but was asking questions.
However, if I interpret correctly what the OP did with the function
get_last_notmissing(),

perhaps the expressions I used are close enough to the OP’s expectations. At least according to isequal()

Summary

julia> using Dates, DataFrames

julia> function get_last_notmissing(x)
           if all(ismissing, x)
               return missing
           else
               return last(collect(skipmissing(x)))
           end
       end
get_last_notmissing (generic function with 1 method)

julia> dt=unique(rand(Date(2021, 1, 1):Day(1):Date(2024, 12, 31),900))     
678-element Vector{Date}:
 2022-03-14
 2021-08-05
 2024-11-19
 ⋮
 2022-12-19
 2023-07-05
 2022-03-03

julia> dtf=[d in dt ? d : missing for d in    Date(2021, 1, 1):Day(1):Date(2024, 12, 31)];

julia> dff=DataFrame(;dtf);

julia> tr(d)=[year(d), month(d),day(d)]
tr (generic function with 1 method)

julia> tr(d::Missing)=[missing,missing,missing]
tr (generic function with 2 methods)

julia> dfymd=transform(dff, :dtf => (ByRow(tr)=>[:y,:m,:d]))
1461×4 DataFrame
  Row │ dtf         y        m        d       
      │ Date?       Int64?   Int64?   Int64?
──────┼───────────────────────────────────────
    1 │ 2021-01-01     2021        1        1
    2 │ missing     missing  missing  missing
    3 │ missing     missing  missing  missing
    4 │ missing     missing  missing  missing
    5 │ 2021-01-05     2021        1        5
    6 │ missing     missing  missing  missing
    7 │ missing     missing  missing  missing
    8 │ missing     missing  missing  missing
   
  ⋮   │     ⋮          ⋮        ⋮        ⋮
 
 1455 │ missing     missing  missing  missing
 1456 │ missing     missing  missing  missing
 1457 │ missing     missing  missing  missing
 1458 │ missing     missing  missing  missing
 1459 │ 2024-12-29     2024       12       29
 1460 │ missing     missing  missing  missing
 1461 │ 2024-12-31     2024       12       31
                             1429 rows omitted

julia> a=combine(groupby(dfymd, [:y,:m]), :d => last => :d) ;

julia> b=combine(groupby(dfymd, [:y,:m]), :d => get_last_notmissing => :d);

julia> isequal(a,b)
true

julia> a
49×3 DataFrame
 Row │ y        m        d       
     │ Int64?   Int64?   Int64?
─────┼───────────────────────────
   1 │    2021        1       31
   2 │    2021        2       25
   
  ⋮  │    ⋮        ⋮        ⋮
  
  47 │    2024       11       28
  48 │    2024       12       31
  49 │ missing  missing  missing
                  17 rows omitted

pdeffebach · November 22, 2024, 4:03pm

There is no indication in OP’s problem that they are missing days, just that they are missing values. And even if they were missing days, there’s no indication they were missing months, too.

rocco_sprmnt21 · November 22, 2024, 4:15pm

Probably you are rigth

in that case one can do


lnm(v)=reduce((x,y)->coalesce(y,x), v)


df=DataFrame(dt=Date(2021,1,1):Date(2024,12,31))

df.v=[rand()>0.9 ? rand() : missing for _ in 1:length(df.dt)]

dfymd=transform(df, :dt => (ByRow(d->[year(d), month(d),day(d)])=>[:y,:m,:d]))


julia> unstack(combine(groupby(dfymd, [:y,:m]), :v => lnm=>:lv ),:y,:lv,allowmissing=true)
12×5 DataFrame
 Row │ m      2021        2022             2023              2024          
     │ Int64  Float64?    Float64?         Float64?          Float64?      
─────┼─────────────────────────────────────────────────────────────────    
   1 │     1  0.0450962         0.744343         0.397001    0.586697      
   2 │     2  0.407692          0.0420504        0.578325    0.707726      
   3 │     3  0.0732233   missing                0.302875    0.0196249     
   4 │     4  0.603087          0.780901         0.00713252  0.823499      
   5 │     5  0.0223508         0.66128          0.935348    0.539294      
   6 │     6  0.992866          0.183029         0.0583805   0.625163      
   7 │     7  0.343401          0.584346         0.143872    0.0593296     
   8 │     8  0.00121432        0.127658   missing           0.376164      
   9 │     9  0.643115          0.424879         0.912506    0.850037      
  10 │    10  0.366782          0.382763         0.749203    0.428814      
  11 │    11  0.380521          0.769725         0.778575    0.969888      
  12 │    12  0.380221          0.873964         0.919393    0.770854

Topic		Replies	Views
Delete missing values after the last non missing value in each id New to Julia dataframes	7	558	September 1, 2022
How can I skip missing values of a DF without deleating them? New to Julia dataframes	4	698	November 11, 2021
Summarizing grouped DataFrame where a group is entirely missing Data dataframes , missing-values , splitapplycombine	22	1263	December 18, 2022
Dropmissing(!) is undefined for GroupedDataFrames Data dataframes	15	593	April 6, 2022
Iterate over all numeric columns in DataFrames Data	21	4854	February 11, 2018

Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values

Related topics