The following is a “real problem” in the sense that it has been posted in a forum with reference to a real situation.
Therefore the starting data are not pretty settled.
I have (or almost … missing, in fact, the addition of the holidays. But I am looking forward to a comfortable solution, it annoys me to do it by hand) a solution, but I would like to see others and above all improve it where possible.
Almost certainly in reading the source csv file, one do better in determining the type of the columns. Considering that, for example, the format of numbers and dates is not the local one (“,” instead of “.” for n umbers). I need to do some conversion, etc.
I wonder which packages are best suited in this regard?
I would like to know what is the most convenient way to add a set of custom holiday dates and / or predefined holidays locally, to exclude them from working days?
For the resolution, I used some auxiliary functions and DataFrames.
But can it also be solved using other ways?
here is the input file
[Can’t upload files directly here?]
What one want to achieve , for each employee, is to group and add up all the sickness absences made on consecutive days, considering consecutive even the days that are separated only by holidays and weekends.
This is the expected result
66×6 DataFrame
Row │ Employee Number ResID(T) Department firstandlast tot_days_cons n
│ Int64 String String Any Float64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 10003 John Smith HR ("2019-01-11", "2019-01-16") 4.0 4
2 │ 10004 Joe Bloggs HR ("2019-02-04", "2019-02-05") 2.0 2
3 │ 10004 Joe Bloggs HR 2019-04-01 1.0 1
4 │ 10004 Joe Bloggs HR 2019-10-29 1.0 1
5 │ 10006 Janice Evans HR 2019-07-02 0.1 1
6 │ 10006 Janice Evans HR 2019-07-18 0.27 1
7 │ 10006 Janice Evans HR ("2019-08-28", "2019-08-29") 0.47 2
8 │ 10006 Janice Evans HR 2019-10-03 0.17 1
9 │ 10006 Janice Evans HR 2019-10-17 0.27 1
10 │ 10008 Peter Burroughs HR 2019-04-17 0.47 1
11 │ 10008 Peter Burroughs HR 2019-04-30 0.27 1
12 │ 10008 Peter Burroughs HR 2019-05-07 0.13 1
13 │ 10008 Peter Burroughs HR 2019-06-05 0.07 1
14 │ 10008 Peter Burroughs HR ("2019-06-12", "2019-06-13") 0.22 2
15 │ 10008 Peter Burroughs HR 2019-07-02 0.13 1
16 │ 10008 Peter Burroughs HR 2019-09-03 0.33 1
17 │ 10008 Peter Burroughs HR 2019-10-01 0.07 1
18 │ 10008 Peter Burroughs HR ("2019-11-29", "2019-12-03") 1.0 3
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
49 │ 10022 Alice Barnard Sales 2019-06-03 0.53 1
50 │ 10022 Alice Barnard Sales 2019-06-13 1.0 1
51 │ 10022 Alice Barnard Sales 2019-07-16 0.3 1
52 │ 10022 Alice Barnard Sales 2019-07-25 0.23 1
53 │ 10022 Alice Barnard Sales 2019-07-29 0.4 1
54 │ 10022 Alice Barnard Sales 2019-08-01 0.4 1
55 │ 10022 Alice Barnard Sales 2019-08-09 0.77 1
56 │ 10022 Alice Barnard Sales ("2019-08-20", "2019-08-21") 0.97 2
57 │ 10022 Alice Barnard Sales 2019-10-01 0.4 1
58 │ 10022 Alice Barnard Sales 2019-10-16 0.23 1
59 │ 10022 Alice Barnard Sales 2019-10-18 0.53 1
60 │ 10022 Alice Barnard Sales ("2019-11-04", "2019-11-11") 6.0 6
61 │ 10022 Alice Barnard Sales ("2019-11-12", "2019-11-28") 13.0 13
62 │ 10022 Alice Barnard Sales 2019-11-29 1.0 1
63 │ 10022 Alice Barnard Sales 2019-12-05 0.7 1
64 │ 10024 Daisy Meadows Sales 2019-09-05 0.5 1
65 │ 10024 Daisy Meadows Sales ("2019-09-18", "2019-09-18") 1.0 2
66 │ 10024 Daisy Meadows Sales 2019-12-19 1.0 1
This is the solution that I have found and that I expect to improve
using BusinessDays, Dates, CSV, DataFrames
df=CSV.read("C:\\Users\\sprmn\\...\\grouping consecutive sick days.csv", DataFrame)
df.Date=Date.(df.Date, Dates.DateFormat("dd/mm/yyyy"))
df.Days= parse.(Float64,replace.(df.Days, ","=>"."))
sort!(df,["Employee Number", "Date"])
gdf=groupby(df,[1,2,3])
cal=BusinessDays.USSettlement()
function consbdays(d)
gid=similar(d)
gid[1]=d[1]
for i in 2:length(d)
if bdays(cal, d[i-1],d[i])==Day(1)
gid[i]=gid[i-1]
else
gid[i]=d[i]
end
end
return gid
end
tdf=transform(gdf, :Date=>consbdays=>:DateGr)
tgdf=groupby(tdf,["Employee Number","ResID(T)","DateGr","Department"])
function extrem(x)
try only(x)
catch e
(string(first(x)),string(last(x)))
end
end
c1=combine(tgdf, :Date=>(extrem)=>
:firstandlast,:Days=>sum=>:tot_days_cons, nrow=>:n)[:,Not(:DateGr)]