Efficient Disk Usage and JLD


#1

Suppose I generate the following two files using JLD

X1=randn(5,10^4);
X2 = [randn(5) for i=1:10^4];
@save "X1.jld" X1;
@save "X2.jld" X2;

Both data structures have the same amount of information, but X1.jld takes up 404 KB, while X2.jld takes up 3.6 MB. Is there a way to get data formatted X2 to behave more efficiently or do I just need to rewrite my code?


#2

It looks like each array probably has a few hundred bytes of overhead for storage in JLD, which is a bit high for 5x8 bytes of data per array
X3 = [randn(10^4) for i = 1:5] should be pretty close to the size of X1.jld


#3

Yes, I agree, that’s much better, but it does not conform to my problem design. I am imagining problems where I have (\mathbf{x}_n)_{n=1}^N, with each \mathbf{x}_n \in \mathbb{R}^d, and, typically, d\ll N, but d could still be relatively large, say 100. The problem structure is that of a time series of vector valued quantities. I could rewrite teh whole thing in terms of a 2D array, but I was really hoping to avoid that because it lets me leave \mathbf{x}_n a bit more abstract.


#4

I’ll add that while it’s still larger than I like, JLD2 does appear to be more efficiently (by about a factor of 3) than JLD.


#5

It might be worth switching the format on load/save for similar ‘many, short fixed length arrays’ cases

function v2transpose(v)
  n = size(v,1);
  m = size(v[1],1)
  v2 = [zeros(n) for i=1:m]
  [v2[i][j] = v[j][i] for i=1:m, j=1:n]
  return v2;
  end;

#6

You can give custom serializer and deserializers for types to JLD. Pack your data more efficiently in the serializer and pack it back up with the deserializer.