Hello,
For one of the codes I was working on I had to count the number of unique weeks in a timeseries. I ended up using the Dates.week()
function and found that the result can be incorrect in cases where the data spans more than an year:
julia> dates = range(Date(2008, 8, 1), Date(2009, 1, 1), step=Day(1))
Date("2008-08-01"):Day(1):Date("2009-01-01")
julia> unique([(year(x), week(x)) for x in dates])
24-element Vector{Tuple{Int64, Int64}}:
(2008, 31)
(2008, 32)
(2008, 33)
(2008, 34)
(2008, 35)
(2008, 36)
(2008, 37)
(2008, 38)
(2008, 39)
(2008, 40)
(2008, 41)
(2008, 42)
(2008, 43)
(2008, 44)
(2008, 45)
(2008, 46)
(2008, 47)
(2008, 48)
(2008, 49)
(2008, 50)
(2008, 51)
(2008, 52)
(2008, 1) <-- week not present in data: Date(2008-12-29)
(2009, 1)
julia> length(unique([(year(x), week(x)) for x in dates]))
24
Here, (2008, 1)
is not correct because this week is not present in dates
. Date(2008, 1, 1)
will also return (2008, 1)
in (year, week)
tuple, the date which is not even present in the data. If one counts the unique number of weeks using (year(x), week(x))
method the result is incorrect because the last week number of December 2008 is same the first week of 2009. The correct result would be 23
.
This is, of course, because week()
gives the ISO week number but it is easy to make mistake while using the week()
function. I feel this is a use case of having a count(::AbstractVector{Date}, ::Week)
method in the Dates
module.
I am counting the number of weeks starting from Monday and ending on Sunday. So, a series starting from Sunday and ending next Monday (9 days) will contain 3 weeks.
I have written sample methods for date vector and range while taking care of unsorted, irregular, and missing values:
function count(dates::AbstractRange{Date}, ::Type{Week})
numweeks::Int = 0
firstmonday = findfirst(x -> dayofweek(x) == 1, dates)
lastsunday = findlast(x -> dayofweek(x) == 7, dates)
numweeks += firstmonday == 1 ? 0 : 1
numweeks += lastsunday == lastindex(dates) ? 0 : 1
numweeks += div((dates[lastsunday] - dates[firstmonday] + Day(1)).value, 7)
numweeks
end
function count(dates::AbstractVector{Date}, ::Type{Week}, sorted::Bool)
mindate = sorted ? dates[1] : findmin(dates)[1]
maxdate = sorted ? last(dates) : findmax(dates)[1]
count(range(mindate, maxdate, step=Day(1)), Week)
end
function count(dates::T, ::Type{Week}) where {T<:Base.SkipMissing{Vector{Union{Missing, Date}}}}
count(collect(dates), Week)
end
function count(dates::T, ::Type{Week}) where {T<:AbstractVector{Union{Missing, Date}}}
count(skipmissing(dates), Week)
end
Performance for sorted range/vector:
For sorted data:
julia> dates = range(Date(1900, 1, 1), Date(2024, 12, 1), step=Day(1))
Date("1900-01-01"):Day(1):Date("2024-12-01")
julia> @btime count(dates, Week)
182.734 ns (1 allocation: 16 bytes)
6518
julia> dd = collect(dates); @btime count(dd, Week, true)
191.824 ns (1 allocation: 16 bytes)
6518
For unsorted data:
julia> dates_unsorted = sample(dates, length(dates), replace=true);
julia> @btime count(dates_unsorted, Week, false)
46.837 μs (1 allocation: 16 bytes)
6518
The algorithm is O(1) for sorted data but for for unsorted data it is O(n) but that’s the best we can get to.
Is the ease-of-use and this performance acceptable to be included in the Dates
module?
This count()
method can be extended to counting months, quarters, or years as well. Here, I am making a case to introduce