# Counting weeks in a timeseries

Hello,

For one of the codes I was working on I had to count the number of unique weeks in a timeseries. I ended up using the `Dates.week()` function and found that the result can be incorrect in cases where the data spans more than an year:

``````julia> dates = range(Date(2008, 8, 1), Date(2009, 1, 1), step=Day(1))
Date("2008-08-01"):Day(1):Date("2009-01-01")

julia> unique([(year(x), week(x)) for x in dates])
24-element Vector{Tuple{Int64, Int64}}:
(2008, 31)
(2008, 32)
(2008, 33)
(2008, 34)
(2008, 35)
(2008, 36)
(2008, 37)
(2008, 38)
(2008, 39)
(2008, 40)
(2008, 41)
(2008, 42)
(2008, 43)
(2008, 44)
(2008, 45)
(2008, 46)
(2008, 47)
(2008, 48)
(2008, 49)
(2008, 50)
(2008, 51)
(2008, 52)
(2008, 1)   <-- week not present in data: Date(2008-12-29)
(2009, 1)

julia>  length(unique([(year(x), week(x)) for x in dates]))
24
``````

Here, `(2008, 1)` is not correct because this week is not present in `dates`. `Date(2008, 1, 1)` will also return `(2008, 1)` in `(year, week)` tuple, the date which is not even present in the data. If one counts the unique number of weeks using `(year(x), week(x))` method the result is incorrect because the last week number of December 2008 is same the first week of 2009. The correct result would be `23`.

This is, of course, because `week()` gives the ISO week number but it is easy to make mistake while using the `week()` function. I feel this is a use case of having a `count(::AbstractVector{Date}, ::Week)` method in the `Dates` module.

I am counting the number of weeks starting from Monday and ending on Sunday. So, a series starting from Sunday and ending next Monday (9 days) will contain 3 weeks.

I have written sample methods for date vector and range while taking care of unsorted, irregular, and missing values:

``````function count(dates::AbstractRange{Date}, ::Type{Week})
numweeks::Int = 0
firstmonday = findfirst(x -> dayofweek(x) == 1, dates)
lastsunday = findlast(x -> dayofweek(x) == 7, dates)
numweeks += firstmonday == 1 ? 0 : 1
numweeks += lastsunday == lastindex(dates) ? 0 : 1
numweeks += div((dates[lastsunday] - dates[firstmonday] + Day(1)).value, 7)
numweeks
end

function count(dates::AbstractVector{Date}, ::Type{Week}, sorted::Bool)
mindate = sorted ? dates : findmin(dates)
maxdate = sorted ? last(dates) : findmax(dates)
count(range(mindate, maxdate, step=Day(1)), Week)
end

function count(dates::T, ::Type{Week}) where {T<:Base.SkipMissing{Vector{Union{Missing, Date}}}}
count(collect(dates), Week)
end

function count(dates::T, ::Type{Week}) where {T<:AbstractVector{Union{Missing, Date}}}
count(skipmissing(dates), Week)
end
``````

### Performance for sorted range/vector:

For sorted data:

``````julia> dates = range(Date(1900, 1, 1), Date(2024, 12, 1), step=Day(1))
Date("1900-01-01"):Day(1):Date("2024-12-01")

julia> @btime count(dates, Week)
182.734 ns (1 allocation: 16 bytes)
6518

julia> dd = collect(dates); @btime count(dd, Week, true)
191.824 ns (1 allocation: 16 bytes)
6518
``````

For unsorted data:

``````julia> dates_unsorted = sample(dates, length(dates), replace=true);

julia> @btime count(dates_unsorted, Week, false)
46.837 μs (1 allocation: 16 bytes)
6518
``````

The algorithm is O(1) for sorted data but for for unsorted data it is O(n) but that’s the best we can get to.

Is the ease-of-use and this performance acceptable to be included in the `Dates` module?

This `count()` method can be extended to counting months, quarters, or years as well. Here, I am making a case to introduce

2 Likes

For the date range case, why not counting weeks as follows:

``````julia> divrem(length(dates), 7)
(56, 5)

julia> @btime divrem(length(\$dates), 7);
6.000 ns (0 allocations: 0 bytes)
``````

For series that starts on a Sunday and ends on a Monday should return `3` as the result:

``````julia> dt = Date(2021, 10, 31)
julia> dates = range(dt, length=9, step=Day(1)); # Sunday -> Monday
julia> divrem(length(dates), 7)
(1, 2)
``````

If we add remainder to the quotient then we get `2` which is not what we want.

Sorry, I misunderstood the problem. Thanks for your explanation.

For a simple solution,

``````length(unique(firstdayofweek(x) for x in dates))
``````

works, but the performance is atrocious compared to your proposal.

This returns 3 for `dates = range(Date(2022, 8, 21), Date(2022, 8, 29), step=Day(1))` (i.e. Sunday to Monday).

It also returns 58 for the original date range (`range(Date(2022, 12, 1), Date(2024, 1, 1), step=Day(1))`), but your proposed `count` function also returns 58 for that. (And I don’t understand the explanation that concludes " The correct result would be `57`").

In my opinion,

• the performance benefits compared to the naive solution,
• the fact that it’s easy to get this wrong
• the easy interface it provides

are all good reasons to have this in a library.

Other than Dates, TimeSeries.jl and TSx.jl are also potentially good places to have this in.

Edit: On second thought, some parts of this don’t make sense in the current form.

Counting all the weeks between `min` and `max` in a list, when the data itself might not have datetime instants in any of those weeks, is pretty unintuitive. The first method, for `AbstractRange`s, makes sense and is useful. The others IMO are confusing and unnecessary. Wanting the number of weeks between the `min` and `max` in your list - whether or not dates in those weeks exist in your list at all - is not a common enough use case to deserve adding here. It’s easy enough for the user to pass `min(datelist):max(datelist)` if they did want that for some reason.

The functionality similar to this that I could find in other languages was: weeksBetween in Joda-Time in Java, and diff in Moments.js.
And in both of those cases, the function accepts a start and end point, and computes the number of weeks beween them. So it’s similar to the `AbstractRange` method here. They only count "whole week"s, though, and it seems useful to have that as an option here too.

``````function count(::Type{Week}, dates::AbstractRange{Date}; partial = true)
numweeks = 0
firstmonday = findfirst(x -> dayofweek(x) == 1, dates)
lastsunday = findlast(x -> dayofweek(x) == 7, dates)
if partial
numweeks += firstmonday == 1 ? 0 : 1
numweeks += lastsunday == lastindex(dates) ? 0 : 1
end
numweeks += div((dates[lastsunday] - dates[firstmonday] + Day(1)).value, 7)
numweeks
end
``````

The `partial` option controls whether only whole weeks are counted. With `partial = false`, this would behave like the `weeksBetween` and `diff` functions mentioned above.

Note that I changed the order of the arguments too. In the existing `Base.count` methods, it’s usually `count(condition, list)`, for eg. `count(pattern, string)`. So I changed the order here too, to place the `Period` we’re searching for first, in order to be consistent with the other methods of `count`.

Yes, this is a typo and I guess I had this from an earlier incorrect solution (I tried multiple before coming up with this one). Will fix it in the original post. In fact, the real issue which I had faced with doing `unique([(year(x), ...])` was that the order of the resulting vector was incorrect because the `(year, week)` tuple would get repeated.

The reason to include the count function in Dates module would be to avoid code duplication in other timeseries packages because I feel this is pretty standard stuff. R xts package has nweeks (+ ndays, nmonths, etc.) functions to count the number of weeks the data spans.

Specifically, for timeseries data one needs to count the number of weeks (or any other period). For example, as a first level check of the number of observations and the span of the data. This becomes especially useful when there are missing values in the data (there are 4 `missing` values but 8 other data so-called data holes). And, as you mentioned counting weeks isn’t trivial so the method handling `AbstractVector{Union{Missing, Date}}` becomes valuable. Also, `count(Week, skipmissing(dates))` adds to the usability.

`partial` is a good argument for use cases where whole weeks are required and I agree with keeping it `true` by default.

Also, agree with the change in the order of arguments to match with existing methods.

1 Like