Wow, thanks everyone for engaging with this line of questioning!
Let me try to respond to all of the interesting thoughts folks have thrown out.
@rafael.guerra
Do not know about a single function that does that. For a moving average you may use a 2-step solution: (i) interpolate the series to a constant time step and then (ii) use convolution with the required window lenght.
For some aggregations (like counting the number of trades) that I want to compute, interpolation is not a viable solution. However, on the other hand it might be interesting to interpolate to allow a fixed structure to allow for things like learning aggregations as a convnet. I’ll tuck this idea away!
Stock market is not my cup of tea, but a general question if you will: what happens if within a “bucket” with 2 h span most trade takes place in the first 30 min with sell price ~USD100 and that during the next 1h30 fewer trades are registered selling at ~USD50? Should the average price over the 2h-bucket be weighted by the number of transactions or by the time duration the price stayed at a given level?
Bucketing by time will naturally lose a lot of information like this. One approach that can help address this particular situation is to aggregate first to very granular bars (if you’re not familiar with the lingo, “bars” is a fairly standard finance term for aggregations of trade information across time, total dollar value, etc.). From granular bars, you can compute time-weighted statistics like change in dollar-weighted price over time. Thus if you aggregate to minute bars and then aggregate those minute bars into hour bars, you can get hour bars with the kinds of indicator of sudden price change that you’re interested in.
@Skoffer
You can write iterator, which returns start_id:end_id
of the window.
So true! Once I tried actually starting to write an iterator, returning a range rather than a sequence of cartesian indices occurred to me as well. The tricky part is making sure to return an empty range for time periods with no rows. Luckily something like array[10:9]
returns an empty array!
@andrewdinhobl
I had to do this recently, and after some work got it down to a few lines in DataFrames. It assumes that you have a column called timestamp, and a period from the Dates
packages, e.g. Dates.Minute(5)
[code]
Wow, as a newcomer to DataFrames.jl
this totally blows me away!
It won’t give you buckets without an observation though.
Sadly, this is a bit of a dealbreaker for my particular use-case, since I want to be able to aggregate to things like “total dollar value traded during this period was zero” and also to make sure all time bars are evenly spaced.
Let me share my current solution
I implemented an iterator which will iterate over any array of timestamps and yield UnitRange
objects that match the integer indices of the start and end of every time bucket. If a time bucket is empty, an empty range like 10:9
will be returned.
struct RollingPeriodWindow
timestamps::AbstractArray{Dates.AbstractDateTime}
period::Dates.Period
end
struct RollingPeriodWindowState
index::Int
start_timestamp::Union{Nothing, Dates.AbstractDateTime}
end
RollingPeriodWindowState() = RollingPeriodWindowState(1, nothing)
function Base.iterate(
rolling::RollingPeriodWindow,
state::RollingPeriodWindowState=RollingPeriodWindowState()
)
max_index = length(rolling.timestamps)
if state.index > max_index
return nothing
end
index = state.index
start_index = index
start_timestamp = state.start_timestamp === nothing ? floor(rolling.timestamps[start_index], rolling.period) : state.start_timestamp
end_timestamp = start_timestamp + rolling.period
println(end_timestamp)
while index <= max_index && rolling.timestamps[index] < end_timestamp
index += 1
end
end_index = index - 1
next_state = RollingPeriodWindowState(index, end_timestamp)
return (start_index:end_index, next_state)
end
Base.IteratorSize(IterType::RollingPeriodWindow) = Base.SizeUnknown()
Base.IteratorEltype(IterType::RollingPeriodWindow) = Base.HasEltype()
Base.eltype(IterType::RollingPeriodWindow) = UnitRange{Int}
To use this approach, we can simply use a for-loop. I have a testing DataFrame
object called tmp
in this example.
for x in RollingPeriodWindow(tmp.timestamp[1:100], Dates.Second(1))
# supports views!
slice = view(tmp.timestamp, x)
# this prints something like `11:13`
println(x)
# this prints something like
# `[DateTime("2018-01-03T14:30:06.115"), DateTime("2018-01-03T14:30:06.116"), DateTime("2018-01-03T14:30:06.220")]`
println(slice)
end
Making this code open source and accessible to others?
I’d love to share this functionality with others, since it seems useful (maybe @andrewdinhobl will find it useful when skipping empty buckets is not appropriate). However, I am not sure how best to do this.
Digging into TimeSeries.jl
, it seems that its approach to working with DataFrame
objects with heterogeneous column types is to treat the data as a 2d array of type Any
. This strikes me as probably bad for performance, and in any case I don’t really want to be forced to transform my DataFrame
objects into new TimeArray
objects from a code clarity point of view.
I was thinking perhaps it’s best to create a standalone tiny library like RollingPeriodWindows.jl
, and then from there maybe submit a PR to TimeSeries.jl
to let it use this library to support period-based windows in addition to integer-based windows, as per @Skoffer’s comment:
Theoretically, TimeSeries.jl
can add such a method by dispatching on w::Period
I’m quite new to the culture of Julia, though, so advice on whether this is a good approach or not would be much appreciated.