Grouping consecutive sick days

The following is a “real problem” in the sense that it has been posted in a forum with reference to a real situation.
Therefore the starting data are not pretty settled.
I have (or almost … missing, in fact, the addition of the holidays. But I am looking forward to a comfortable solution, it annoys me to do it by hand) a solution, but I would like to see others and above all improve it where possible.
Almost certainly in reading the source csv file, one do better in determining the type of the columns. Considering that, for example, the format of numbers and dates is not the local one (“,” instead of “.” for n umbers). I need to do some conversion, etc.
I wonder which packages are best suited in this regard?
I would like to know what is the most convenient way to add a set of custom holiday dates and / or predefined holidays locally, to exclude them from working days?
For the resolution, I used some auxiliary functions and DataFrames.
But can it also be solved using other ways?

here is the input file
[Can’t upload files directly here?]

What one want to achieve , for each employee, is to group and add up all the sickness absences made on consecutive days, considering consecutive even the days that are separated only by holidays and weekends.

This is the expected result

66×6 DataFrame
 Row │ Employee Number  ResID(T)          Department  firstandlast                  tot_days_cons  n     
     │ Int64            String            String      Any                           Float64        Int64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │           10003  John Smith        HR          ("2019-01-11", "2019-01-16")           4.0       4 
   2 │           10004  Joe Bloggs        HR          ("2019-02-04", "2019-02-05")           2.0       2 
   3 │           10004  Joe Bloggs        HR          2019-04-01                             1.0       1 
   4 │           10004  Joe Bloggs        HR          2019-10-29                             1.0       1 
   5 │           10006  Janice Evans      HR          2019-07-02                             0.1       1 
   6 │           10006  Janice Evans      HR          2019-07-18                             0.27      1 
   7 │           10006  Janice Evans      HR          ("2019-08-28", "2019-08-29")           0.47      2 
   8 │           10006  Janice Evans      HR          2019-10-03                             0.17      1 
   9 │           10006  Janice Evans      HR          2019-10-17                             0.27      1 
  10 │           10008  Peter Burroughs   HR          2019-04-17                             0.47      1 
  11 │           10008  Peter Burroughs   HR          2019-04-30                             0.27      1 
  12 │           10008  Peter Burroughs   HR          2019-05-07                             0.13      1 
  13 │           10008  Peter Burroughs   HR          2019-06-05                             0.07      1 
  14 │           10008  Peter Burroughs   HR          ("2019-06-12", "2019-06-13")           0.22      2
  15 │           10008  Peter Burroughs   HR          2019-07-02                             0.13      1
  16 │           10008  Peter Burroughs   HR          2019-09-03                             0.33      1
  17 │           10008  Peter Burroughs   HR          2019-10-01                             0.07      1
  18 │           10008  Peter Burroughs   HR          ("2019-11-29", "2019-12-03")           1.0       3
  ⋮  │        ⋮                ⋮              ⋮                    ⋮                      ⋮          ⋮
  49 │           10022  Alice Barnard     Sales       2019-06-03                             0.53      1
  50 │           10022  Alice Barnard     Sales       2019-06-13                             1.0       1
  51 │           10022  Alice Barnard     Sales       2019-07-16                             0.3       1
  52 │           10022  Alice Barnard     Sales       2019-07-25                             0.23      1
  53 │           10022  Alice Barnard     Sales       2019-07-29                             0.4       1
  54 │           10022  Alice Barnard     Sales       2019-08-01                             0.4       1
  55 │           10022  Alice Barnard     Sales       2019-08-09                             0.77      1
  56 │           10022  Alice Barnard     Sales       ("2019-08-20", "2019-08-21")           0.97      2
  57 │           10022  Alice Barnard     Sales       2019-10-01                             0.4       1
  58 │           10022  Alice Barnard     Sales       2019-10-16                             0.23      1
  59 │           10022  Alice Barnard     Sales       2019-10-18                             0.53      1
  60 │           10022  Alice Barnard     Sales       ("2019-11-04", "2019-11-11")           6.0       6
  61 │           10022  Alice Barnard     Sales       ("2019-11-12", "2019-11-28")          13.0      13
  62 │           10022  Alice Barnard     Sales       2019-11-29                             1.0       1
  63 │           10022  Alice Barnard     Sales       2019-12-05                             0.7       1
  64 │           10024  Daisy Meadows     Sales       2019-09-05                             0.5       1
  65 │           10024  Daisy Meadows     Sales       ("2019-09-18", "2019-09-18")           1.0       2
  66 │           10024  Daisy Meadows     Sales       2019-12-19                             1.0       1

This is the solution that I have found and that I expect to improve

using BusinessDays, Dates, CSV, DataFrames
df=CSV.read("C:\\Users\\sprmn\\...\\grouping consecutive sick days.csv", DataFrame)
df.Date=Date.(df.Date, Dates.DateFormat("dd/mm/yyyy"))
df.Days= parse.(Float64,replace.(df.Days, ","=>"."))
sort!(df,["Employee Number", "Date"])
gdf=groupby(df,[1,2,3])

cal=BusinessDays.USSettlement()
function consbdays(d)
    gid=similar(d)
    gid[1]=d[1]
    for i in 2:length(d)
        if bdays(cal, d[i-1],d[i])==Day(1)
            gid[i]=gid[i-1]
        else
                gid[i]=d[i]
        end
    end
    return gid
end

tdf=transform(gdf, :Date=>consbdays=>:DateGr)
tgdf=groupby(tdf,["Employee Number","ResID(T)","DateGr","Department"])

function extrem(x)
    try only(x)
    catch e
        (string(first(x)),string(last(x)))
    end
end

c1=combine(tgdf, :Date=>(extrem)=>
          :firstandlast,:Days=>sum=>:tot_days_cons, nrow=>:n)[:,Not(:DateGr)]

There is a nice technique I learned from @bkamins here: Creating User-Defined Grouped DataFrames That might work well for your particular use case. In my linked post, basically, you group by creating a “dummy” column that groups based on how you want the grouping logic defined. This might be a more concise approach to your solution. Hope it helps!

Thanks for the reference.
I know the technique (on some occasions bkamins suggested it to me as an alternative to my request to possibly introduce a parameter in the groupby function such as the comparison criteria available in some functions of the SplitApplyCombine.jl package).
In fact, for my solution, I used this technique.

It’s mostly a difference in syntax, but I added grouping by derived columns to @groupby in DataFrameMacros.jl. I find it nicer to specify a special grouping column directly in the grouping call, rather than adding it first and then grouping by it. But it does the same in the background anyway, so it’s not like it’s faster or anything.

1 Like

This is something I have, coincidentally, had to do twice in the past year. Here’s how I would do it.


julia> df = @chain "grouping consecutive sick days.csv" begin 
           CSV.read(DataFrame)
           @rtransform begin 
               :Date = Date(:Date, dateformat"dd/mm/yyy")
               :Days = parse(Float64, replace(:Days, "," => "."))
           end
           @orderby $"Employee Number" :Date
           groupby(["Employee Number", "ResID(T)", "Department"])
           @transform :diff_days = :Date .- lag(:Date)
           @rtransform :diff_days = ismissing(:diff_days) ? 1 : :diff_days.value
           @rtransform :new_day = :diff_days != 1
           groupby(["Employee Number", "ResID(T)", "Department"])
           @transform :sick_block_num = cumsum(:new_day)
           groupby(["Employee Number", "ResID(T)", "Department"])
           @transform begin 
               :absence_start = minimum(:Date)
               :absence_end = maximum(:Date)
           end
       end

Some notes on what we could do to make things easier.

  1. Some sort of interactive mode with @chain where I can see the data frame afer each transformation as I type. I know I can just add df to things and make them mutable, but then I have this global state that changes which is annoying. This would be a pretty major undertaking, I think. You would have to have a new REPL mode that hooks into the parsing infrastructure of Chain.jl.

  2. We need a @groupby command, but I’m still hesitant to add transformations to it, since that’s not what DataFrames.jl does. But some sort of var-arg @groupby would be nice. Writing this code now, the behavior I really want is something like

@groupby df :a :b :c begin 
    @transform ...
    @transform
end

Where @transform inside that block is ungroup = false. So that the grouping is preserved across transformations.

  1. The last command, making the start and end days, should be handled by the new @astable macro-flag, which enables the creation of multiple columns in a single transform block. It’s in master but I haven’t made a release yet since I’ve had trouble with the documentation. But I should make a release today.

  2. ShiftedArrays.jl needs a diff function which is the same size as the original array. It looks like I’ve had this thought before. I should really make a PR.

2 Likes

From what I see there doesn’t seem to be any more convenient shortcuts for reading numbers in a local format other than your own (I was imagining some option like local = “IT” and everything was fine by itself without using format and / or replace).
I point out, even if it is not a big difference, that in my case, even days like 01/10/2021 and 04/10/2021 (written in the format dd / mm / yyyy) are consecutive.

On your point 4., would such a thing be fine or is it not generic / robust enough?

diffnew(v;default=missing)=v.- ShiftedArray(v,1;default=default)

Indeed, here is my PR.

Shouldn’t rows 60 and 61 be combined? also, row 7 shouldn’t be two separated rows since they are not full day leave?

You are right about lines 60 and 61.
This derives from the fact that the bdays function gives the following result and I don’t know why (maybe some holiday on the American calendar, not known by hr peoples. We need to investigate …)

julia> bdays(cal, Date(2019,11,11),Date(2019,11,12))
0 days

about line 7 I don’t know how to understand … HR people who are interested in these things should be asked :slight_smile:

ps
Maybe I understand the problem. I used the US calendar from the BusinessDays.jl library.
From google it seems that 11/11 is a holiday in the USA (Veterans Day ).
Surely in the country of the person who posted this issue, 11/11/2019 was not a public holiday.
Therefore …

Also, you need to account for weekends. Presumably there is something that maps dates to workdays, numbered 1, 2, ... 261.

yes.
changing the holiday calendar the script works better

#cal=BusinessDays.USSettlement()

cal=BusinessDays.EuroZone()

julia> bdays(cal, Date(2021,11,11),Date(2021,11,12))
1 day

julia> combine(tgdf, :Date=>(extrem)=>:firstandlast,
         :Days=>sum=>:tot_days_cons, nrow=>:n)[:,Not(:DateGr)]
64×6 DataFrame
 Row │ Employee Number  ResID(T)          Department  firstandlast                  tot_days_cons  n     
     │ Int64            String31          String7     Any                           Float64        Int64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │           10003  John Smith        HR          ("2019-01-11", "2019-01-16")           4.0       4
   2 │           10004  Joe Bloggs        HR          ("2019-02-04", "2019-02-05")           2.0       2
   3 │           10004  Joe Bloggs        HR          2019-04-01                             1.0       1
   4 │           10004  Joe Bloggs        HR          2019-10-29                             1.0       1
   5 │           10006  Janice Evans      HR          2019-07-02                             0.1       1
   6 │           10006  Janice Evans      HR          2019-07-18                             0.27      1
   7 │           10006  Janice Evans      HR          ("2019-08-28", "2019-08-29")           0.47      2
   8 │           10006  Janice Evans      HR          2019-10-03                             0.17      1
   9 │           10006  Janice Evans      HR          2019-10-17                             0.27      1
  10 │           10008  Peter Burroughs   HR          2019-04-17                             0.47      1
  11 │           10008  Peter Burroughs   HR          2019-04-30                             0.27      1
  12 │           10008  Peter Burroughs   HR          2019-05-07                             0.13      1
  13 │           10008  Peter Burroughs   HR          2019-06-05                             0.07      1
  14 │           10008  Peter Burroughs   HR          ("2019-06-12", "2019-06-13")           0.22      2
  15 │           10008  Peter Burroughs   HR          2019-07-02                             0.13      1
  16 │           10008  Peter Burroughs   HR          2019-09-03                             0.33      1
  17 │           10008  Peter Burroughs   HR          2019-10-01                             0.07      1
  18 │           10008  Peter Burroughs   HR          ("2019-11-29", "2019-12-03")           1.0       3
  ⋮  │        ⋮                ⋮              ⋮                    ⋮                      ⋮          ⋮
  47 │           10022  Alice Barnard     Sales       2019-03-14                             0.2       1
  48 │           10022  Alice Barnard     Sales       ("2019-04-01", "2019-04-02")           2.0       2
  49 │           10022  Alice Barnard     Sales       2019-06-03                             0.53      1
  50 │           10022  Alice Barnard     Sales       2019-06-13                             1.0       1
  51 │           10022  Alice Barnard     Sales       2019-07-16                             0.3       1
  52 │           10022  Alice Barnard     Sales       2019-07-25                             0.23      1
  53 │           10022  Alice Barnard     Sales       2019-07-29                             0.4       1
  54 │           10022  Alice Barnard     Sales       2019-08-01                             0.4       1
  55 │           10022  Alice Barnard     Sales       2019-08-09                             0.77      1
  56 │           10022  Alice Barnard     Sales       ("2019-08-20", "2019-08-21")           0.97      2
  57 │           10022  Alice Barnard     Sales       2019-10-01                             0.4       1
  58 │           10022  Alice Barnard     Sales       2019-10-16                             0.23      1
  59 │           10022  Alice Barnard     Sales       2019-10-18                             0.53      1
  60 │           10022  Alice Barnard     Sales       ("2019-11-04", "2019-11-29")          20.0      20
  61 │           10022  Alice Barnard     Sales       2019-12-05                             0.7       1
  62 │           10024  Daisy Meadows     Sales       2019-09-05                             0.5       1
  63 │           10024  Daisy Meadows     Sales       ("2019-09-18", "2019-09-18")           1.0       2
  64 │           10024  Daisy Meadows     Sales       2019-12-19                             1.0       1
                                                                                          28 rows omitted

BTW, you are aware that the firstandlast column is very odd and will not be performant, right? Having a vector of type Any means working with it will be slow. And having heterogeneous entry types might cause headaches for people trying to use your data. Why not make two columns?

Well … yes and no. I thought that a vector of type any is not the best for the performance aspects and therefore I had thought of changing the extrema function in the following way (but I was not sure if this was a solution to the problem and I did not worry so much about these aspects that I would still like to start learning )

function extremOLD(x)
    try only(x)
    catch e
        (string(first(x)),string(last(x)))
    end
end

julia> function extrem(x)
           try (string(only(x)),)
           catch e
               (string(first(x)),string(last(x)))
           end
       end
extrem (generic function with 1 method)

julia> c1=combine(tgdf, :Date=>(extrem)=>
                 :firstandlast,:Days=>sum=>:tot_days_cons, nrow=>:n)[:,Not(:DateGr)][1:10,:]
10×6 DataFrame
 Row │ Employee Number  ResID(T)         Department  firstandlast                  tot_days_cons  n     
     │ Int64            String31         String7     Tuple…                        Float64        Int64 
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │           10003  John Smith       HR          ("2019-01-11", "2019-01-16")           4.0       4
   2 │           10004  Joe Bloggs       HR          ("2019-02-04", "2019-02-05")           2.0       2
   3 │           10004  Joe Bloggs       HR          ("2019-04-01",)                        1.0       1
   4 │           10004  Joe Bloggs       HR          ("2019-10-29",)                        1.0       1
   5 │           10006  Janice Evans     HR          ("2019-07-02",)                        0.1       1
   6 │           10006  Janice Evans     HR          ("2019-07-18",)                        0.27      1
   7 │           10006  Janice Evans     HR          ("2019-08-28", "2019-08-29")           0.47      2
   8 │           10006  Janice Evans     HR          ("2019-10-03",)                        0.17      1
   9 │           10006  Janice Evans     HR          ("2019-10-17",)                        0.27      1
  10 │           10008  Peter Burroughs  HR          ("2019-04-17",)                        0.47      1

regarding dividing the start-end pair into two columns, I had only followed a requirement of the applicant and I did not think about preparing the field for further processing

Now I’m more curious though. How are you getting the data back to the applicant? You cant write a tuple of dates to CSV, so they will have difficulty reading it in.

This was the required result plus some auxiliary columns. So I thought that putting the information about the beginning and the end of the absence on some sort of list might be what was needed.
Perhaps it would be more consistent with the request to make a concatenation of the two strings.