Transforming a daily DataFrame with missing values into a DataFrame with end-of-month values

I am trying to transform a daily DataFrame (df) into a monthly DataFrame (monthly_df) with end-of-month values. The following code works:

monthly_df = df[findall(date -> date == lastdayofmonth(date), df.date), :]

However, there are missing values and I would like to get last non-missing value of each month. The following code,

monthly_df = combine( groupby(df, month.(df.date)), :date => last => :date, [Symbol(col) => (x -> last(skipmissing(x))) => Symbol(col) for col in names(df, Number)]... )

errors with,

ERROR: ArgumentError: invalid index: #542 of type var"#542#543"

I am sure I am complicating something very simpleโ€ฆ Any help would be much appreciated!

You have two problems

  1. names(df, Number) isnโ€™t going to pick up columns of type Union{Float, Missing}. So you need to have a different strategy. Maybe Not(...) or something along those lines.
  2. The main issue is that skipmissing(x) does not work with last. skipmissing returns a โ€œlazyโ€ iterator. It just loops through non-missing values, but doesnโ€™t check where those non-missing values are. The solution is to use collect
julia> df = @chain begin
           Iterators.product(1:12, 1:30)
           collect
           vec
           DataFrame(month = first.(_), day = last.(_))
           @rtransform begin
               :val1 = rand() < 0.1 ? missing : rand()
               :val2 = rand() < 0.1 ? missing : rand()
           end
       end;

julia> out = @chain df begin
           @groupby :month
           @aside cols = names(df, r"^val")
           combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
       end
12ร—3 DataFrame
 Row โ”‚ month  val1      val2       
     โ”‚ Int64  Float64   Float64    
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1  0.213515  0.0442133
   2 โ”‚     2  0.646935  0.00186216
   3 โ”‚     3  0.77759   0.35766
   4 โ”‚     4  0.963361  0.455094
   5 โ”‚     5  0.292968  0.948964
   6 โ”‚     6  0.852407  0.438937
   7 โ”‚     7  0.486393  0.415348
   8 โ”‚     8  0.545606  0.56915
   9 โ”‚     9  0.619751  0.853727
  10 โ”‚    10  0.309468  0.316852
  11 โ”‚    11  0.385376  0.23983
  12 โ”‚    12  0.940781  0.908539
1 Like

Thank you so much for pointing me in the right direction!
In the original DataFrame (df) all columns are as you wrote Union{Float64, Missing} type, except the first column (date) that is DateTime type. So I did the following,

out = @chain df begin
    @rtransform :year = year(:date)
    @rtransform :month = month(:date)
    groupby([:year, :month])
    @aside cols = names(df)[2:end] 
    combine(_, cols .=> (t -> last(collect(skipmissing(t)))) .=> cols)
end

(I made these changes without much knowledge, as Iโ€™ve never used the DataFramesMeta package before). However now I am getting the following error,

ERROR: BoundsError: attempt to access 0-element Vector{Float64} at index [0]
Stacktrace:
[1] _combine(gd::GroupedDataFrame{โ€ฆ}, cs_norm::Vector{โ€ฆ}, optional_transform::Vector{โ€ฆ}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:755
[2] _combine_prepare_norm(gd::GroupedDataFrame{โ€ฆ}, cs_vec::Vector{โ€ฆ}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[3] _combine_prepare(gd::GroupedDataFrame{โ€ฆ}, ::Base.RefValue{โ€ฆ}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:52
[4] _combine_prepare
@ ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:26     [inlined]
[5] combine(gd::GroupedDataFrame{โ€ฆ}, args::Union{โ€ฆ}; keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:857
[6] top-level scope
@ ~/Papers/ZPaper/julia/etl.jl:156 
caused by: TaskFailedException
Stacktrace:
[1] wait(t::Task)
@ Base ./task.jl:370
[2] _combine(gd::GroupedDataFrame{โ€ฆ}, cs_norm::Vector{โ€ฆ}, optional_transform::Vector{โ€ฆ}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:752
[3] _combine_prepare_norm(gd::GroupedDataFrame{โ€ฆ}, cs_vec::Vector{โ€ฆ}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/groupeddataframe/splitapplycombine.jl:87
[4] _combine_prepare(gd::GroupedDataFrame{โ€ฆ}, ::Base.RefValue{โ€ฆ}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool) 

(truncated)

That just means you have months with all missing values:

julia> (last โˆ˜ collect โˆ˜ skipmissing)([missing, missing])
ERROR: BoundsError: attempt to access 0-element Vector{Union{}} at index [0]

so youโ€™ll have to thkink about what to do with those. If your aggregation logic becomes too complicated, consider writing a separate function for clarity and applying that, e.g.

function get_last_nonmissing(x)
    if all(ismissing, x)
        return "All missing"
    else
        return last(collect(skipmissing(x)))
    end
end

and then you can do combine(..., cols .=> get_last_nonmissing .=> cols)

1 Like

Ugh. These kinds of issues with missing make is very hard to do data cleaning in Julia with messy, real-world data sets.

Fortunately, all the pieces for handling missing values are there, but you have to know about and keep track of a dizzying array of functions to work with missing values effectively.

Missings.jl has the emptymissing function which wraps a function taking in an iterator and returns missing if that iterator is missing.

julia> df = @chain begin
           Iterators.product(1:12, 1:30)
           collect
           vec
           DataFrame(month = first.(_), day = last.(_))
           @rtransform begin
               :val1 = rand() < 0.1 ? missing : rand()
               :val2 = rand() < 0.1 ? missing : rand()
               :val3 = missing
           end
       end;

julia> out = @chain df begin
           @groupby :month
           @aside cols = names(df, r"^val")
           combine(_, cols .=> (t -> emptymissing(last)(collect(skipmissing(t)))) .=> cols)
       end
12ร—4 DataFrame
 Row โ”‚ month  val1       val2       val3    
     โ”‚ Int64  Float64    Float64    Missing 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1  0.294201   0.0236851  missing 
   2 โ”‚     2  0.786612   0.187981   missing 
   3 โ”‚     3  0.389113   0.259168   missing 
   4 โ”‚     4  0.290554   0.755325   missing 
   5 โ”‚     5  0.811688   0.406319   missing 
   6 โ”‚     6  0.0819031  0.475326   missing 
   7 โ”‚     7  0.826316   0.0263804  missing 
   8 โ”‚     8  0.0819761  0.589041   missing 
   9 โ”‚     9  0.0136988  0.0561459  missing 
  10 โ”‚    10  0.117192   0.0990837  missing 
  11 โ”‚    11  0.790855   0.0800563  missing 
  12 โ”‚    12  0.244128   0.937939   missing 
1 Like

without any claim to be a valid answer to the OPโ€™s request, but to revive an old post about the effectiveness of the groupby function implementation in the DataFrames package, if d is the vector of dates (not necessarily ordered) one could do the following to have the searched values โ€‹โ€‹in the order from January to December:

mmax(n::Int, m::Missing)=n
mmax(n::Missing, m::Int)=m
mmax(n,m) = max(n,m)
maxs=Vector{Union{Missing,Int}}(undef,12)
#maxs=fill(0,12)
for i in eachindex(d)
    maxs[month(d[i])]=mmax(maxs[month(d[i])],day(d[i]))
end
maxs

Thank you very much. Both functions:

(t -> emptymissing(last)(collect(skipmissing(t))))

and

function get_last_nonmissing(x)
    if all(ismissing, x)
        return missing
    else
        return last(collect(skipmissing(x)))
    end
end

work like a charm!

Indeed, dealing with missing in Julia is not too pleasant. Although I understand the concept behind the behaviour of missing, in my opinion, it is too strict and becomes unpractical in some occasions.

2 Likes

I think everyone agrees that managing โ€œmissingโ€ is always a delicate matter and to be seen on a case by case basis.

For example, if there is no month in my data that is missing all its days, the following simple scheme does the trick (*).

combine(groupby(dfymd, [:y,:m]), :d => last)

Providing an extra row and column that informs me that there are missing days

dt=unique(rand(Date(2021, 1, 1):Day(1):Date(2024, 12, 31),900))

dtf=[d in dt ? d : missing for d in    Date(2021, 1, 1):Day(1):Date(2024, 12, 31)]
dff=DataFrame(;dtf)
tr(d)=[year(d), month(d),day(d)]
tr(d::Missing)=[missing,missing,missing]
dfymd=transform(dff, :dtf => (ByRow(tr)=>[:y,:m,:d]))

combine(groupby(dfymd, [:y,:m]), :d => last)

I put it in the following form so that it is better readable

julia> unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)
13ร—6 DataFrame
 Row โ”‚ m        2021     2022     2023     2024     missing 
     โ”‚ Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚       1       29       26       29       31  missing
   2 โ”‚       2       25       28       24       28  missing
   3 โ”‚       3       31       29       31       31  missing
   4 โ”‚       4       27       30       30       29  missing
   5 โ”‚       5       30       31       27       30  missing
   6 โ”‚       6       28       30       30       30  missing
   7 โ”‚       7       27       21       31       29  missing
   8 โ”‚       8       30       29       31       30  missing
   9 โ”‚       9       26       28       30       30  missing
  10 โ”‚      10       30       30       26       31  missing
  11 โ”‚      11       30       28       30       30  missing
  12 โ”‚      12       30       28       30       30  missing
  13 โ”‚ missing  missing  missing  missing  missing  missing

If there is a month with all the days missing, things change, but not much it seems.

dtfam=copy(dtf)
dtfam[1:31].=missing

dffam=DataFrame(;dtfam)

dfymd=transform(dffam, :dtfam => (ByRow(tr)=>[:y,:m,:d]))

unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)

julia> unstack(combine(groupby(dfymd, [:y,:m]), :d => last => :d),:y,:d,allowmissing=true)
13ร—6 DataFrame
 Row โ”‚ m        2021     2022     2023     2024     missing 
     โ”‚ Int64?   Int64?   Int64?   Int64?   Int64?   Int64?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚       2       28       27       27       26  missing
   2 โ”‚       3       31       31       31       31  missing
   3 โ”‚       4       30       25       27       29  missing
   4 โ”‚       5       31       28       30       30  missing
   5 โ”‚       6       28       27       28       30  missing
   6 โ”‚       7       29       24       31       22  missing
   7 โ”‚       8       31       31       31       30  missing
   8 โ”‚       9       30       29       26       29  missing
   9 โ”‚      10       30       30       29       30  missing
  10 โ”‚      11       29       27       29       29  missing
  11 โ”‚      12       31       22       31       31  missing
  12 โ”‚       1  missing       23       31       31  missing
  13 โ”‚ missing  missing  missing  missing  missing  missing

So I wonder what your real situation is.

PS
I also wonder why in this case the month of January 2021 (with all missing) was put at the end of the group

(*) The reason is that the groupby function implicitly filters out missing data, assuming that if there is a missing day, the month and year of that day are also missing.

Thatโ€™s not what OP asked for, though.

OP wanted the last non-missing value, hence the use of skipmissing.

So this digression doesnโ€™t help the OP solve their original problem.

My digression basically asks the OP to specify what exactly the problem is: he had no ambition to answer but was asking questions.
However, if I interpret correctly what the OP did with the function
get_last_notmissing(),

perhaps the expressions I used are close enough to the OPโ€™s expectations. At least according to isequal()

Summary
julia> using Dates, DataFrames

julia> function get_last_notmissing(x)
           if all(ismissing, x)
               return missing
           else
               return last(collect(skipmissing(x)))
           end
       end
get_last_notmissing (generic function with 1 method)

julia> dt=unique(rand(Date(2021, 1, 1):Day(1):Date(2024, 12, 31),900))     
678-element Vector{Date}:
 2022-03-14
 2021-08-05
 2024-11-19
 โ‹ฎ
 2022-12-19
 2023-07-05
 2022-03-03

julia> dtf=[d in dt ? d : missing for d in    Date(2021, 1, 1):Day(1):Date(2024, 12, 31)];

julia> dff=DataFrame(;dtf);

julia> tr(d)=[year(d), month(d),day(d)]
tr (generic function with 1 method)

julia> tr(d::Missing)=[missing,missing,missing]
tr (generic function with 2 methods)

julia> dfymd=transform(dff, :dtf => (ByRow(tr)=>[:y,:m,:d]))
1461ร—4 DataFrame
  Row โ”‚ dtf         y        m        d       
      โ”‚ Date?       Int64?   Int64?   Int64?
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ 2021-01-01     2021        1        1
    2 โ”‚ missing     missing  missing  missing
    3 โ”‚ missing     missing  missing  missing
    4 โ”‚ missing     missing  missing  missing
    5 โ”‚ 2021-01-05     2021        1        5
    6 โ”‚ missing     missing  missing  missing
    7 โ”‚ missing     missing  missing  missing
    8 โ”‚ missing     missing  missing  missing
   
  โ‹ฎ   โ”‚     โ‹ฎ          โ‹ฎ        โ‹ฎ        โ‹ฎ
 
 1455 โ”‚ missing     missing  missing  missing
 1456 โ”‚ missing     missing  missing  missing
 1457 โ”‚ missing     missing  missing  missing
 1458 โ”‚ missing     missing  missing  missing
 1459 โ”‚ 2024-12-29     2024       12       29
 1460 โ”‚ missing     missing  missing  missing
 1461 โ”‚ 2024-12-31     2024       12       31
                             1429 rows omitted

julia> a=combine(groupby(dfymd, [:y,:m]), :d => last => :d) ;

julia> b=combine(groupby(dfymd, [:y,:m]), :d => get_last_notmissing => :d);

julia> isequal(a,b)
true

julia> a
49ร—3 DataFrame
 Row โ”‚ y        m        d       
     โ”‚ Int64?   Int64?   Int64?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚    2021        1       31
   2 โ”‚    2021        2       25
   
  โ‹ฎ  โ”‚    โ‹ฎ        โ‹ฎ        โ‹ฎ
  
  47 โ”‚    2024       11       28
  48 โ”‚    2024       12       31
  49 โ”‚ missing  missing  missing
                  17 rows omitted

There is no indication in OPโ€™s problem that they are missing days, just that they are missing values. And even if they were missing days, thereโ€™s no indication they were missing months, too.

Probably you are rigth

in that case one can do


lnm(v)=reduce((x,y)->coalesce(y,x), v)


df=DataFrame(dt=Date(2021,1,1):Date(2024,12,31))

df.v=[rand()>0.9 ? rand() : missing for _ in 1:length(df.dt)]

dfymd=transform(df, :dt => (ByRow(d->[year(d), month(d),day(d)])=>[:y,:m,:d]))


julia> unstack(combine(groupby(dfymd, [:y,:m]), :v => lnm=>:lv ),:y,:lv,allowmissing=true)
12ร—5 DataFrame
 Row โ”‚ m      2021        2022             2023              2024          
     โ”‚ Int64  Float64?    Float64?         Float64?          Float64?      
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    
   1 โ”‚     1  0.0450962         0.744343         0.397001    0.586697      
   2 โ”‚     2  0.407692          0.0420504        0.578325    0.707726      
   3 โ”‚     3  0.0732233   missing                0.302875    0.0196249     
   4 โ”‚     4  0.603087          0.780901         0.00713252  0.823499      
   5 โ”‚     5  0.0223508         0.66128          0.935348    0.539294      
   6 โ”‚     6  0.992866          0.183029         0.0583805   0.625163      
   7 โ”‚     7  0.343401          0.584346         0.143872    0.0593296     
   8 โ”‚     8  0.00121432        0.127658   missing           0.376164      
   9 โ”‚     9  0.643115          0.424879         0.912506    0.850037      
  10 โ”‚    10  0.366782          0.382763         0.749203    0.428814      
  11 โ”‚    11  0.380521          0.769725         0.778575    0.969888      
  12 โ”‚    12  0.380221          0.873964         0.919393    0.770854