Why is NPZ so much faster than JLD in this case?

Hello! First post here so sorry in advance :')

I am doing numerical simulations on a large dataset for a wide range of parameters, and I was trying to save the results of this analysis using the JLD save function. The array was around 12GB when written to disk, but it took JLD more than 2 hours to save it, whereas when using NPZ’s npzwrite function the save took only a couple of minutes. JLD may have even taken longer, but I eventually just canceled the job since it was a ridiculous amount of time. I needed to include missing values in the array since some (the majority of) simulations simply don’t work, and I was also using NaN values to signify another specific point of failure in the array. Is the issue just that the array is of type Union{missing, Float64} and has many NaN values? If so, I am shocked that JLD cannot handle this situation and took so long to save a fairly simple array.
Thanks in advance! :slight_smile:

I don’t know anything about JLD vs NPZ, but your array should not be of type Union{Missing, Float64}. A NaN is a valid floating point number, so your eltype should just be Float64

1 Like

I do need both NaN and missing values separately. The missing values signify that the simulation did not work at all, while the NaN values tell me a specific function (after the simulation) did not work. So, I will eventually go back and try to fix the cases where NaNs arise, but I don’t want to even try to work on the missing values. So, I realize that NaN values count as Float64, but I also need the missing values to be there.

Note that JLD is using HDF5 under the hood. Your answer lies in the underlying representation of the data. I’m not sure what JLD does with a Union{Missing, Float64}.

Another question is if JLD is trying to use compression.