Save "lazy" dataset that is calculated from another saved datasets

sairus7 · December 4, 2018, 3:51pm

Given some datasets A and B in a columnar storage (say, DataFrames), and another dataset C that is calculated based on A and B, e.g. C =

A + B
first derivative of A
smoothed version of B
etc.

How can I save and load “virtual” dataset C without saving its actual values, just by defining the transform function of the original datasets A and B? What is the common solution here - should I replace it with lambda-function serialization or with source code text that should be evaluated on dataset read? Is that possible in some storage formats?

Tamas_Papp · December 4, 2018, 6:34pm

Not sure I understand your requirements, but something like

could help.

jpsamaroo · December 4, 2018, 7:08pm

I don’t think there’s a “standard” way to do this, because it’s a problem that can have a number of solutions with various tradeoffs. The two packages I would try are BSON.jl and JLD2.jl. If these don’t work, there are plenty of other storage formats that you could try. Or, alternatively, just store a piece of code together with your non-lazy datasets that automatically loads them and creates the lazy datastructures you need. What you choose depends entirely on your project’s requirements, and your preferences.

bennedich · December 5, 2018, 9:37pm

Saving a piece of source code that will be executed on read is one solution, but for anything else than hobby projects, I would advise against it. It has several drawbacks:

It’s hard to test / quality check code that’s stored in this format
It’s not safe; someone can manipulate the data file to execute harmful code
If you suddenly need to change the calculation of derivates or smoothing (or any other serialized formula, maybe you found a bug), it’s not enough to change your code, you’d need to update all serialized datasets
Code might break/change (happened between Julia 0.6 and 0.7; should be rare now, but if you use other packages for example, it could happen more easily)

Therefore, I would opt for adding support for the relevant operations (addition, derivatives, smoothing, etc) in your code base, and then just store the type of operation (e.g. “smooth B”, maybe you want some additional parameters too, and some format like JSON or YAML). So when you read dataset C and parse “smooth B”, you’d call the “smooth” function in your code with column B as an argument.

Topic		Replies	Views
Lazy columns in dataframes? Data question , dataframes	6	485	January 27, 2023
Larger than memory table format with lazy reads? General Usage array , dataframes , juliadb	5	1200	August 19, 2020
Julia save ".rdata" equivalent New to Julia	6	979	June 14, 2020
Save Model Outputs Optimization (Mathematical)	2	331	November 21, 2022
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020

Save "lazy" dataset that is calculated from another saved datasets

Related topics