Save "lazy" dataset that is calculated from another saved datasets

data
dataframes

#1

Given some datasets A and B in a columnar storage (say, DataFrames), and another dataset C that is calculated based on A and B, e.g. C =

  • A + B
  • first derivative of A
  • smoothed version of B
  • etc.

How can I save and load “virtual” dataset C without saving its actual values, just by defining the transform function of the original datasets A and B? What is the common solution here - should I replace it with lambda-function serialization or with source code text that should be evaluated on dataset read? Is that possible in some storage formats?


#2

Not sure I understand your requirements, but something like

could help.


#3

I don’t think there’s a “standard” way to do this, because it’s a problem that can have a number of solutions with various tradeoffs. The two packages I would try are BSON.jl and JLD2.jl. If these don’t work, there are plenty of other storage formats that you could try. Or, alternatively, just store a piece of code together with your non-lazy datasets that automatically loads them and creates the lazy datastructures you need. What you choose depends entirely on your project’s requirements, and your preferences.


#4

Saving a piece of source code that will be executed on read is one solution, but for anything else than hobby projects, I would advise against it. It has several drawbacks:

  • It’s hard to test / quality check code that’s stored in this format
  • It’s not safe; someone can manipulate the data file to execute harmful code
  • If you suddenly need to change the calculation of derivates or smoothing (or any other serialized formula, maybe you found a bug), it’s not enough to change your code, you’d need to update all serialized datasets
  • Code might break/change (happened between Julia 0.6 and 0.7; should be rare now, but if you use other packages for example, it could happen more easily)

Therefore, I would opt for adding support for the relevant operations (addition, derivatives, smoothing, etc) in your code base, and then just store the type of operation (e.g. “smooth B”, maybe you want some additional parameters too, and some format like JSON or YAML). So when you read dataset C and parse “smooth B”, you’d call the “smooth” function in your code with column B as an argument.