Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values

I am trying to transform a daily DataFrame (df) into a monthly DataFrame (monthly_df) with end-of-month values. The following code works:

monthly_df = df[findall(date -> date == lastdayofmonth(date), df.date), :]

However, there are missing values and I would like to get last non-missing value of each month. The following code,

monthly_df = combine( groupby(df, month.(df.date)), :date => last => :date, [Symbol(col) => (x -> last(skipmissing(x))) => Symbol(col) for col in names(df, Number)]... )

errors with,

ERROR: ArgumentError: invalid index: #542 of type var"#542#543"

I am sure I am complicating something very simple… Any help would be much appreciated!

You have two problems

  1. names(df, Number) isn’t going to pick up columns of type Union{Float, Missing}. So you need to have a different strategy. Maybe Not(...) or something along those lines.
  2. The main issue is that skipmissing(x) does not work with last. skipmissing returns a “lazy” iterator. It just loops through non-missing values, but doesn’t check where those non-missing values are. The solution is to use collect
julia> df = @chain begin
           Iterators.product(1:12, 1:30)
           collect
           vec
           DataFrame(month = first.(_), day = last.(_))
           @rtransform begin
               :val1 = rand() < 0.1 ? missing : rand()
               :val2 = rand() < 0.1 ? missing : rand()
           end
       end;

julia> out = @chain df begin
           @groupby :month
           @aside cols = names(df, r"^val")
           combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
       end
12×3 DataFrame
 Row │ month  val1      val2       
     │ Int64  Float64   Float64    
─────┼─────────────────────────────
   1 │     1  0.213515  0.0442133
   2 │     2  0.646935  0.00186216
   3 │     3  0.77759   0.35766
   4 │     4  0.963361  0.455094
   5 │     5  0.292968  0.948964
   6 │     6  0.852407  0.438937
   7 │     7  0.486393  0.415348
   8 │     8  0.545606  0.56915
   9 │     9  0.619751  0.853727
  10 │    10  0.309468  0.316852
  11 │    11  0.385376  0.23983
  12 │    12  0.940781  0.908539

Thank you so much for pointing me in the right direction!
In the original DataFrame (df) all columns are as you wrote Union{Float64, Missing} type, except the first column (date) that is DateTime type. So I did the following,

out = @chain df begin
    @rtransform :year = year(:date)
    @rtransform :month = month(:date)
    groupby([:year, :month])
    @aside cols = names(df)[2:end] 
    combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
end

(I made these changes without much knowledge, as I’ve never used the DataFramesMeta package before). However now I am getting the following error,

ERROR: BoundsError: attempt to access 0-element Vector{Float64} at index [0]
Stacktrace:
[1] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:755
[2] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[3] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:52
[4] _combine_prepare
@ ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:26     [inlined]
[5] combine(gd::GroupedDataFrame{…}, args::Union{…}; keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:857
[6] top-level scope
@ ~/Papers/ZPaper/julia/etl.jl:156 
caused by: TaskFailedException
Stacktrace:
[1] wait(t::Task)
@ Base ./task.jl:370
[2] _combine(gd::GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:752
[3] _combine_prepare_norm(gd::GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[4] _combine_prepare(gd::GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool) 

(truncated)

That just means you have months with all missing values:

julia> (last ∘ collect ∘ skipmissing)([missing, missing])
ERROR: BoundsError: attempt to access 0-element Vector{Union{}} at index [0]

so you’ll have to thkink about what to do with those. If your aggregation logic becomes too complicated, consider writing a separate function for clarity and applying that, e.g.

function get_last_nonmissing(x)
    if all(ismissing, x)
        return "All missing"
    else
        return last(collect(skipmissing(x)))
    end
end

and then you can do combine(..., cols .=> get_last_nonmissing .=> cols)