Common data utilities for time series

I frequently find myself having to cleanup and process raw ‘real’ data. I am really not a data person, but I think this is called data plumbing. And to be honest, I don’t have much fun doing it :slight_smile:

Typical operations include:

  • Finding gaps in the data
  • Imputation
  • ffill / bfill / Interpolation
  • Outlier detection and correction
  • Resampling
  • Identifying usable sections in noisy data

Some low level functions for these operations are provided by packages like DSP.jl, DataFrames.jl, Interpolations.jl. But it takes some work to hook them up to your specific time-aware data source and format, and as a result the past few years I ended up writing multiple implementations of resample.

So ideally what I want is a very lightweight package that works directly on Tables.jl data sources, preferably lazily or in place with minimal copying. All you would need to know is: what is my data axis? What is my time axis? Which operations do I want to apply? And then just do it. That way we don’t have to maintain our own time-aware types or index columns, like TSFrames.jl and TimeSeries.jl do. It should also be trivial to stack multiple operations using for example Chain.jl, operate on multiple columns with multithreading, add new processing functions, either directly or through extensions.

Does such a package exist, and did I miss it? Should it exist? Any comments or ideas are welcome.

I’m almost sure you can do all that with GMT.jl. Some recipes are already high level implemented, others would probably need to dig in docs. See these examples

Thanks! That definitely has some useful features. I didn’t expect to find that in a package for geographic data. My idea was to have a package that is focused on tabular and time-series data.

GMT is Generic. Both geographic and Cartesian.

Give impute a try. I was quite impressed.

TableTransforms.jl has plenty of transforms for Tables.jl.

Yes, that is a very useful package. Not the data plumbing methods I’m looking for. But could be easily composed with such methods

As a side comment, notice that we’ve built TableTransforms.jl on top of the TransformsBase.jl API, which is implemented in various other downstream packages like GeoStatsTransforms.jl for GeoTables.jl. If you come up with new transforms for time series, try to implement the same API.

My impression is that many time series transforms could simply be added to TableTransforms.jl as they only assume order of rows. If the transform relies on actual time differences (e.g. lags), then it makes sense to store them in a separate package.

I encounter the names like data wrangling and data munging even more frequently. Could help when searching for some more discussions.

I work with Time Series all the time and I am a big fan of the TimeSeries.jl package. It is very basic and does not have most (if not all) functions that you need, but it has an interface to add your own functions. In my needs those were all one-liners.

If you find yourself doing the same thing over and over again and feel that you are capable in Julia, consider trying it out and maybe pushing your changes to add to the package.

I tried it a while back, and I’m just not a fan, sorry. I don’t want to work with a special time-aware type, a time index column, or conversions between TimeArray and DataFrame. These should be simple operations directly on Tables.jl sources.

I tried TimeSeries.jl in the past and liked it. The API isn’t ideal though, and I wish they had transforms like we did with GeoTables.jl.

Depending on the operations, it can be really painful to treat things as simple Tables.jl.

Fair enough.

I rely heavily exactly on these to generate helper matrices for my program, before I make all normal arrays for calculation.

If the tables you work with have fixed structure you could define structs for those column sets to use with StructArrays.jl. Then you can define func(tbl::StructVector{<:MyTableType}), then your table structures are in the type tree so you get good code reuse. If you have large numbers of columns or a large number of column subsets it won’t work of course.

I have a couple of unregistered packages, not promoting just including for illustration:

But if you truly want time series ops to be agnostic to any kind of table, I haven’t found that. Mwbase has some basic moving windows stuff, but it’s not intended for imputation.

TimeBars.jl mainly defines some abstract types / conventions I use. I used to have an impute method there (relying on Impute.jl), but I removed it as I don’t use it and I want the package to be minimal.

Maybe I’m missing something here, but why would you want to dispatch on the table type? If your methods are compatible with the Tables.jl interface (i.e. consuming Tables.jl-compatible sources), you won’t have to care about the type of your table at all.

In general, you don’t need to. There were two reasons for my needs:

  • I was fine with using StructArrays.jl for almost everything. StructVector is basically a columntable so you get the Tables interface for free
  • I wanted methods that dispatched based on specific subsets of columns or the type of index. This kind of thing wouldn’t make sense for everyone, but if you know a lot about your tables in advance I’ve found this method useful.

Open to suggestions for a better way, but compared to the other time series packages this works better for my use cases. It’s not often that I have a large number of columns or unpredictable/changing sets of columns in table work.

EDIT: I wrote this a while ago to explain why I came to this way of doing things (kind of out of date though) - TimeBars.jl/RATIONALE.md at master · kpa28-git/TimeBars.jl · GitHub
The README has some info about that as well.