Time-period-based time series moving windows in Julia?

luke · April 21, 2021, 4:29pm

Looking at windowing in TimeSeries.jl, which seems to be the main package for time series data in Julia, I can’t find anything equivalent to the native time-period-based windowing offered in Pandas, which supports windowing on time periods like “2h” (two hours) even if the number of rows per period varies.

Example use-case

timestamp	symbol	price	trade_id	size	sale_condition_flags
	DateTime	String?	Int64?	Int64?	Int64?	Int64?
1	2018-01-03T14:30:00.098	SPY 	2689650	4664652	40	160
2	2018-01-03T14:30:00.100	SPY 	2689650	4666691	5	160
3	2018-01-03T14:30:00.459	SPY 	2689600	4937338	52	160
4	2018-01-03T14:30:00.672	SPY 	2689700	5073747	77	160
5	2018-01-03T14:30:01.723	SPY 	2690000	5541285	23	160
6	2018-01-03T14:30:01.911	SPY 	2690300	5588358	100	0

The above stock data is an excerpt of the trades of SPY, the ETF (exchange-traded fund) which tracks the S&P 500 index. I pulled it using InvestorsExchange.jl, if anyone is curious. There aren’t a fixed number of stock trades per hour, but it’s useful to compute things like the average hourly price nonetheless. In Pandas, this is as easy as creating a time-based rolling window, but I’m struggling to figure out how to do it in Julia.

My questions

Am I missing an existing package that does this already?
If not, are there some best practices for implementing a decently efficient time-series windowing operation in Julia? I’m thinking of trying to write something lightweight that maps a vector of timestamps and a period to a sequence of cartesian indexes, and then using those cartesian indexes on any generic array/table/dataframe object.

Skoffer · April 21, 2021, 6:17pm

I think there is no need for cartesian indices or something like that. You can write iterator, which returns start_id:end_id of the window. And update step is just moving start_id to the next value and move end_id zero or more steps while the length of the window is less or equal necessary time period. It should be not hard to implement and then you can use this range to extract and transform necessary data.

Theoretically, TimeSeries.jl can add such a method by dispatching on w::Period

andrewdinhobl · April 21, 2021, 10:31pm

I don’t know of a package that natively takes in “2h” as a resample time. I had to do this recently, and after some work got it down to a few lines in DataFrames. It assumes that you have a column called timestamp, and a period from the Dates packages, e.g. Dates.Minute(5). It won’t give you buckets without an observation though.

function bucket(df::DataFrame, stepsize::Period; how=x->mean(skipmissing(x)))
    # transform df by adding new resample bucket column by rounding timestamps in original df to nearest lower bucket
    df2 = transform(df, :timestamp => ByRow(x->round(x,stepsize, RoundDown)), renamecols=false)

    # aggregate by timestamp bucket
    gdf = groupby(df2, :timestamp)
    z = combine(gdf, names(df, Not(:timestamp)) .=> how, :timestamp .=> first, renamecols=false)

    return z
end

I really like how composable the DataFrames pipelines are and how much thought has been put in to making them work fast. That being said, there was a lot more overhead to get this to work than in Pandas, so it would be neat if something similar was added to an existing package. I believe rolling average is supported somewhere, just not bucketing.

luke · April 22, 2021, 12:31am

Wow, thanks everyone for engaging with this line of questioning!

Let me try to respond to all of the interesting thoughts folks have thrown out.

@rafael.guerra

Do not know about a single function that does that. For a moving average you may use a 2-step solution: (i) interpolate the series to a constant time step and then (ii) use convolution with the required window lenght.

For some aggregations (like counting the number of trades) that I want to compute, interpolation is not a viable solution. However, on the other hand it might be interesting to interpolate to allow a fixed structure to allow for things like learning aggregations as a convnet. I’ll tuck this idea away!

Stock market is not my cup of tea, but a general question if you will: what happens if within a “bucket” with 2 h span most trade takes place in the first 30 min with sell price ~USD100 and that during the next 1h30 fewer trades are registered selling at ~USD50? Should the average price over the 2h-bucket be weighted by the number of transactions or by the time duration the price stayed at a given level?

Bucketing by time will naturally lose a lot of information like this. One approach that can help address this particular situation is to aggregate first to very granular bars (if you’re not familiar with the lingo, “bars” is a fairly standard finance term for aggregations of trade information across time, total dollar value, etc.). From granular bars, you can compute time-weighted statistics like change in dollar-weighted price over time. Thus if you aggregate to minute bars and then aggregate those minute bars into hour bars, you can get hour bars with the kinds of indicator of sudden price change that you’re interested in.

@Skoffer

You can write iterator, which returns start_id:end_id of the window.

So true! Once I tried actually starting to write an iterator, returning a range rather than a sequence of cartesian indices occurred to me as well. The tricky part is making sure to return an empty range for time periods with no rows. Luckily something like array[10:9] returns an empty array!

@andrewdinhobl

I had to do this recently, and after some work got it down to a few lines in DataFrames. It assumes that you have a column called timestamp, and a period from the Dates packages, e.g. Dates.Minute(5) [code]

Wow, as a newcomer to DataFrames.jl this totally blows me away!

It won’t give you buckets without an observation though.

Sadly, this is a bit of a dealbreaker for my particular use-case, since I want to be able to aggregate to things like “total dollar value traded during this period was zero” and also to make sure all time bars are evenly spaced.

Let me share my current solution

I implemented an iterator which will iterate over any array of timestamps and yield UnitRange objects that match the integer indices of the start and end of every time bucket. If a time bucket is empty, an empty range like 10:9 will be returned.

struct RollingPeriodWindow
    timestamps::AbstractArray{Dates.AbstractDateTime}
    period::Dates.Period
end

struct RollingPeriodWindowState
    index::Int
    start_timestamp::Union{Nothing, Dates.AbstractDateTime}
end
RollingPeriodWindowState() = RollingPeriodWindowState(1, nothing)

function Base.iterate(
    rolling::RollingPeriodWindow,
    state::RollingPeriodWindowState=RollingPeriodWindowState()
)
    max_index = length(rolling.timestamps)
    if state.index > max_index
        return nothing
    end
    index = state.index
    start_index = index
    start_timestamp = state.start_timestamp === nothing ? floor(rolling.timestamps[start_index], rolling.period) : state.start_timestamp
    end_timestamp = start_timestamp + rolling.period
    println(end_timestamp)
    while index <= max_index && rolling.timestamps[index] < end_timestamp
        index += 1        
    end
    end_index = index - 1
    next_state = RollingPeriodWindowState(index, end_timestamp)
    return (start_index:end_index, next_state)
end

Base.IteratorSize(IterType::RollingPeriodWindow) = Base.SizeUnknown()
Base.IteratorEltype(IterType::RollingPeriodWindow) = Base.HasEltype()
Base.eltype(IterType::RollingPeriodWindow) = UnitRange{Int}

To use this approach, we can simply use a for-loop. I have a testing DataFrame object called tmp in this example.

for x in RollingPeriodWindow(tmp.timestamp[1:100], Dates.Second(1))
    # supports views!
    slice = view(tmp.timestamp, x)
    # this prints something like `11:13`
    println(x)
    # this prints something like
    #  `[DateTime("2018-01-03T14:30:06.115"), DateTime("2018-01-03T14:30:06.116"), DateTime("2018-01-03T14:30:06.220")]`
    println(slice)
end

Making this code open source and accessible to others?

I’d love to share this functionality with others, since it seems useful (maybe @andrewdinhobl will find it useful when skipping empty buckets is not appropriate). However, I am not sure how best to do this.

Digging into TimeSeries.jl, it seems that its approach to working with DataFrame objects with heterogeneous column types is to treat the data as a 2d array of type Any. This strikes me as probably bad for performance, and in any case I don’t really want to be forced to transform my DataFrame objects into new TimeArray objects from a code clarity point of view.

I was thinking perhaps it’s best to create a standalone tiny library like RollingPeriodWindows.jl, and then from there maybe submit a PR to TimeSeries.jl to let it use this library to support period-based windows in addition to integer-based windows, as per @Skoffer’s comment:

Theoretically, TimeSeries.jl can add such a method by dispatching on w::Period

I’m quite new to the culture of Julia, though, so advice on whether this is a good approach or not would be much appreciated.

luke · April 22, 2021, 4:46am

For now, the code lives in RollingTimeWindows.jl. Cheers folks!

mks · April 22, 2021, 7:21am

Thanks, vey helpful!