julia> x = rand(1000000);
julia> K = 9000;
julia> result2 = @btime movingmax($x, $K);
21.238 ms (4 allocations: 7.58 MiB)
julia> result3 = @btime rolling_max($x, $K);
24.017 ms (5 allocations: 7.64 MiB)
Yes they are close even for bigger N and K, I think the trick my algorithm is doing is no looking backward to the vector in each iteration, cause my Deque is keeping more information in a tuple. We have also a bit different memory access pattern (last/first in Deque). I have no idea why I did it this way. It is a rewrite from my old code in C#, but I recall I spent a lot of time on tuning this thing.
UPDATE: This package implements similar algorithm using custom ring buffers, it’s 2x faster than our approach.