File Format for Large Two-Dimensional Dataset

I need to store a moderate-to-large matrix, which (essentially) represents a least-squares system, since the matrix entries are much more expensive to evaluate than solving the actual system. I would therefore like to store this system on disk and reload as needed. What would be a convenient and performant file format / Julia package to manage this, satisfying the following requirements:

  • The system can be subdivided into variable size blocks, with a total number of rows and columns in the range of thousands of blocks.
  • Ideally, I’d like the flexibility that each block can be stored as a Dict or equivalent (with its entires representing different “kinds” of data)
  • I need the ability to load arbitrary sub-systems; both row- and columns-slices (representing different subsets of the data, or the parameters).
  • I need to be able to add rows (data) and columns (parameters) to the system.
  • I am unable to hold the entire system in memory.

I’ve had bad experiences with both JLD and JLD2 often unable to load the files I created. In addition JLD (a little more reliable than JLD2 for me) seemed slow? I was planning to try HDF5 next. I understand this is what JLD is based on but was hoping that restricting myself to the basic HDF5 data types would help with performance and robustness. Maybe I have just been using JLD2 incorrectly and should revisit, but JLD2 doesn’t seem to support loading slices? Maybe I should be looking at databases? (I have zero experience with DB.)

However, before I try many different approaches, I would appreciate any advise that people on this forum have.

BSON.jl?

1 Like

Thanks for the suggestion. I was actually unaware of a binary JSON file format. But can it read slices (It wasn’t clear to me from the README or the tests, but maybe I’ve missed something.)

NetCDF.jl?

https://github.com/JuliaGeo/NetCDF.jl

Thank you - this looks promising. I will need to test it. Is it fast?

If all you’re doing is storing a dense matrix of a fixed element type (presumably Float64), you may want to consider rolling your own dead simple storage format: two Int64 values giving the height and width of the matrix, followed by the data as a contiguous block. You can easily load a slice from this format or load the whole thing and there’s no parsing involved.

2 Likes

Thank you. I would have to convert the blocks system into a simple dense matrix. I’ll think about whether this would in fact be easier.

Maybe just store it as a sequence of blocks then? Once you implement a writer and a reader, that’s pretty much it. And if the format is simple and efficient, the interop story would be quite good since implementing it should be easy.

Thank you for the suggestions, Stefan. I understand that this is in principle trivial. However, I want to avoid investing much (ideally any) time in maintaining my own fileformat, simply because (1) I just don’t have the time to do it well and (2) I will surely produce bugs that will take days/weeks/months to surface and then I am stuck with re-computing the entire database most likely at the most inconvenient moment when I quickly need to run tests within the next few minutes.

What I was really hoping for is that there is a simple light-weight database package that allows me to just add/delete rows and columns as needed and where each entry of this matrix may be an arbitrary datatype. But unless I have missed something, this is wishful thinking?

Databases are not traditionally known for supporting dense array storage well. I have to confess I don’t understand the format well enough to give any better advice.

Any advise and perspective is very much appreciated. Thank you for taking the time.

At the moment, I am on the fence between trying NetCDF and using HDF5 directly.

I would be extremely cautious with this approach.

My field (NLP) is utterly littered with people thinking,
“I’ll just roll my own dead simple format”.
So much so that I am maintaining not one, but two packages,
to reconcile different formats that occur for the same kind of data.

FastText in particular looks like it started with a literally what Stefan said,
and then was like “Oh but we need that also” again and again.
Until parsing it’s binary format requires over 100 lines of code.

Don’t propagate new formats.
Not when BSON, or HDF5 will do.

Looks like BSON.jl can’t do that, but NetCDF.jl looks like it can.

This piece of advice makes me a bit uneasy. I thought that the default storage format for Julia was .jld or .jld2, and that this would be a what generally “should be used” unless one needs something custom or text-based.

Coming from Matlab, I’m used to the .mat file format being the built-in, standard file format, and it’s comfortable to know that it is well-maintained, dependable and universally used. I’m also getting used to Python, where the picture isn’t quite as clear, but I thought .jld would be Julia’s .mat. Is that not so?

I cannot tell whether it’s part of stdlib. Is it or will it be?

1 Like

There are various trade-offs between using a custom format and a more “standard” one. Custom formats may be more efficient (tailored to the problem), and maintaining them can also be a benefit, since this provides control over the package.

I re-learned this lesson recently: after moving my workflow to v0.7, I realized that JLD2.jl is broken on v0.7, and while there are PRs (I also made one), I don’t know when it will be fixed, so I went back to a custom mmaped format. AFAICT BSON.jl is also not updated to v0.7 yet.

The advantage of more established formats like HFD5 is being more timeproof, but they are not very well suited to large data (for a given value of “large”). Some other formats are emerging, but who knows where they will be in 5 years. So for interim data that can be regenerated if necessary, I think that custom formats can be a viable choice.

What would you consider large? (My data files are in the range 1-10GB and likely won’t get much bigger than 30 GB in the near future EDIT: actually this last point is not so clear, it may well get much bigger)

I’d happily use JLD2, but as I said above I actually found it to still be buggy.

And in fact JLD2 can’t read slices

Something that may not fit in memory; so this is fuzzy because it depends on what I do with it, but 10–30GB is “large” because I have to resort to other methods (= usually Mmap.mmap).

right - and this is in fact exactly why I am worrying about it - most of the machines I need to use this data from won’t have 32 GB of memory.

This would also make it possible to simply mmap your file.

1 Like