Counting weeks in a timeseries

Hello,

For one of the codes I was working on I had to count the number of unique weeks in a timeseries. I ended up using the Dates.week() function and found that the result can be incorrect in cases where the data spans more than an year:

julia> dates = range(Date(2008, 8, 1), Date(2009, 1, 1), step=Day(1))
Date("2008-08-01"):Day(1):Date("2009-01-01")

julia> unique([(year(x), week(x)) for x in dates])
24-element Vector{Tuple{Int64, Int64}}:
 (2008, 31)
 (2008, 32)
 (2008, 33)
 (2008, 34)
 (2008, 35)
 (2008, 36)
 (2008, 37)
 (2008, 38)
 (2008, 39)
 (2008, 40)
 (2008, 41)
 (2008, 42)
 (2008, 43)
 (2008, 44)
 (2008, 45)
 (2008, 46)
 (2008, 47)
 (2008, 48)
 (2008, 49)
 (2008, 50)
 (2008, 51)
 (2008, 52)
 (2008, 1)   <-- week not present in data: Date(2008-12-29)
 (2009, 1)

julia>  length(unique([(year(x), week(x)) for x in dates]))
24

Here, (2008, 1) is not correct because this week is not present in dates. Date(2008, 1, 1) will also return (2008, 1) in (year, week) tuple, the date which is not even present in the data. If one counts the unique number of weeks using (year(x), week(x)) method the result is incorrect because the last week number of December 2008 is same the first week of 2009. The correct result would be 23.

This is, of course, because week() gives the ISO week number but it is easy to make mistake while using the week() function. I feel this is a use case of having a count(::AbstractVector{Date}, ::Week) method in the Dates module.

I am counting the number of weeks starting from Monday and ending on Sunday. So, a series starting from Sunday and ending next Monday (9 days) will contain 3 weeks.

I have written sample methods for date vector and range while taking care of unsorted, irregular, and missing values:

function count(dates::AbstractRange{Date}, ::Type{Week})
    numweeks::Int = 0
    firstmonday = findfirst(x -> dayofweek(x) == 1, dates)
    lastsunday = findlast(x -> dayofweek(x) == 7, dates)
    numweeks += firstmonday == 1 ? 0 : 1
    numweeks += lastsunday == lastindex(dates) ? 0 : 1
    numweeks += div((dates[lastsunday] - dates[firstmonday] + Day(1)).value, 7)
    numweeks
end

function count(dates::AbstractVector{Date}, ::Type{Week}, sorted::Bool)
    mindate = sorted ? dates[1] : findmin(dates)[1]
    maxdate = sorted ? last(dates) : findmax(dates)[1]
    count(range(mindate, maxdate, step=Day(1)), Week)
end

function count(dates::T, ::Type{Week}) where {T<:Base.SkipMissing{Vector{Union{Missing, Date}}}}
    count(collect(dates), Week)
end

function count(dates::T, ::Type{Week}) where {T<:AbstractVector{Union{Missing, Date}}}
    count(skipmissing(dates), Week)
end

Performance for sorted range/vector:

For sorted data:

julia> dates = range(Date(1900, 1, 1), Date(2024, 12, 1), step=Day(1))
Date("1900-01-01"):Day(1):Date("2024-12-01")

julia> @btime count(dates, Week)
  182.734 ns (1 allocation: 16 bytes)
6518

julia> dd = collect(dates); @btime count(dd, Week, true)
  191.824 ns (1 allocation: 16 bytes)
6518

For unsorted data:

julia> dates_unsorted = sample(dates, length(dates), replace=true);

julia> @btime count(dates_unsorted, Week, false)
  46.837 μs (1 allocation: 16 bytes)
6518

The algorithm is O(1) for sorted data but for for unsorted data it is O(n) but that’s the best we can get to.

Is the ease-of-use and this performance acceptable to be included in the Dates module?

This count() method can be extended to counting months, quarters, or years as well. Here, I am making a case to introduce

2 Likes

For the date range case, why not counting weeks as follows:

julia> divrem(length(dates), 7)
(56, 5)

julia> @btime divrem(length($dates), 7);
  6.000 ns (0 allocations: 0 bytes)

For series that starts on a Sunday and ends on a Monday should return 3 as the result:

julia> dt = Date(2021, 10, 31)
julia> dates = range(dt, length=9, step=Day(1)); # Sunday -> Monday
julia> divrem(length(dates), 7)
(1, 2)

If we add remainder to the quotient then we get 2 which is not what we want.

Sorry, I misunderstood the problem. Thanks for your explanation.

For a simple solution,

length(unique(firstdayofweek(x) for x in dates))

works, but the performance is atrocious compared to your proposal.

This returns 3 for dates = range(Date(2022, 8, 21), Date(2022, 8, 29), step=Day(1)) (i.e. Sunday to Monday).

It also returns 58 for the original date range (range(Date(2022, 12, 1), Date(2024, 1, 1), step=Day(1))), but your proposed count function also returns 58 for that. (And I don’t understand the explanation that concludes " The correct result would be 57").

In my opinion,

  • the performance benefits compared to the naive solution,
  • the fact that it’s easy to get this wrong
  • the easy interface it provides

are all good reasons to have this in a library.

Other than Dates, TimeSeries.jl and TSx.jl are also potentially good places to have this in.


Edit: On second thought, some parts of this don’t make sense in the current form.

Counting all the weeks between min and max in a list, when the data itself might not have datetime instants in any of those weeks, is pretty unintuitive. The first method, for AbstractRanges, makes sense and is useful. The others IMO are confusing and unnecessary. Wanting the number of weeks between the min and max in your list - whether or not dates in those weeks exist in your list at all - is not a common enough use case to deserve adding here. It’s easy enough for the user to pass min(datelist):max(datelist) if they did want that for some reason.

The functionality similar to this that I could find in other languages was: weeksBetween in Joda-Time in Java, and diff in Moments.js.
And in both of those cases, the function accepts a start and end point, and computes the number of weeks beween them. So it’s similar to the AbstractRange method here. They only count "whole week"s, though, and it seems useful to have that as an option here too.

function count(::Type{Week}, dates::AbstractRange{Date}; partial = true)
    numweeks = 0
    firstmonday = findfirst(x -> dayofweek(x) == 1, dates)
    lastsunday = findlast(x -> dayofweek(x) == 7, dates)
    if partial
        numweeks += firstmonday == 1 ? 0 : 1
        numweeks += lastsunday == lastindex(dates) ? 0 : 1
    end
    numweeks += div((dates[lastsunday] - dates[firstmonday] + Day(1)).value, 7)
    numweeks
end

The partial option controls whether only whole weeks are counted. With partial = false, this would behave like the weeksBetween and diff functions mentioned above.

Note that I changed the order of the arguments too. In the existing Base.count methods, it’s usually count(condition, list), for eg. count(pattern, string). So I changed the order here too, to place the Period we’re searching for first, in order to be consistent with the other methods of count.

Yes, this is a typo and I guess I had this from an earlier incorrect solution (I tried multiple before coming up with this one). Will fix it in the original post. In fact, the real issue which I had faced with doing unique([(year(x), ...]) was that the order of the resulting vector was incorrect because the (year, week) tuple would get repeated.

The reason to include the count function in Dates module would be to avoid code duplication in other timeseries packages because I feel this is pretty standard stuff. R xts package has nweeks (+ ndays, nmonths, etc.) functions to count the number of weeks the data spans.

Specifically, for timeseries data one needs to count the number of weeks (or any other period). For example, as a first level check of the number of observations and the span of the data. This becomes especially useful when there are missing values in the data (there are 4 missing values but 8 other data so-called data holes). And, as you mentioned counting weeks isn’t trivial so the method handling AbstractVector{Union{Missing, Date}} becomes valuable. Also, count(Week, skipmissing(dates)) adds to the usability.

partial is a good argument for use cases where whole weeks are required and I agree with keeping it true by default.

Also, agree with the change in the order of arguments to match with existing methods.

1 Like