About two weeks ago I have put a post asking for recommending the file-format use with high-perfomance computing. Our use-case is training of relatively simple neural networks but using terabytes of data. We do that on a single machine with GPU, where there are separate processes loading the data from S3 storage and preparing them for the master thread.
After our reasearch, we have come with following alternatives:
Feather
Parquet
ProtoBuf
JLD
JLD2
HDF5
Let’s discuss all options.
Feather is interesting, but it cannot handle unicode characters and sometime mysteriously crashes when reading large datafiles (we have filed bug for unicode characters).
We did not dare to try parquet, since the installation of the package is anything but trivial, and reading that using Spark.py means having separate Java process, which is something I want to avoid.
ProtoBuf is quite slow to read, even though we have fixed some type instabilities.
JLD contains memory leakage. The bug has been filed but due to transition to JLD2 is not addressed.
JLD2 crashes when saving large array of string (bug has been filed).
HDF5 seems to be the only format that is working (evaluation still under test, fingers crossed otherwise I am doomed and have to go to python, which is something I would rather avoid).
I have to say that this is not very nice situation. I like the Julia and being in a Python / Java environment is not a good feature. I think that having good and stable binary fileformat is important for any large-scale processing. Again, I am talking about terabyte scale data.
I do not know, where to move next. I am happy to help improving any package if my skills are sufficient, or at least do the testing.
Tomas