The poor state of fileformats for High Performance computing


#1

About two weeks ago I have put a post asking for recommending the file-format use with high-perfomance computing. Our use-case is training of relatively simple neural networks but using terabytes of data. We do that on a single machine with GPU, where there are separate processes loading the data from S3 storage and preparing them for the master thread.

After our reasearch, we have come with following alternatives:
Feather
Parquet
ProtoBuf
JLD
JLD2
HDF5

Let’s discuss all options.
Feather is interesting, but it cannot handle unicode characters and sometime mysteriously crashes when reading large datafiles (we have filed bug for unicode characters).

We did not dare to try parquet, since the installation of the package is anything but trivial, and reading that using Spark.py means having separate Java process, which is something I want to avoid.

ProtoBuf is quite slow to read, even though we have fixed some type instabilities.

JLD contains memory leakage. The bug has been filed but due to transition to JLD2 is not addressed.

JLD2 crashes when saving large array of string (bug has been filed).

HDF5 seems to be the only format that is working (evaluation still under test, fingers crossed otherwise I am doomed and have to go to python, which is something I would rather avoid).

I have to say that this is not very nice situation. I like the Julia and being in a Python / Java environment is not a good feature. I think that having good and stable binary fileformat is important for any large-scale processing. Again, I am talking about terabyte scale data.

I do not know, where to move next. I am happy to help improving any package if my skills are sufficient, or at least do the testing.

Tomas


#2

JLD2 would be the best first place for you to assist. I am not involved, but send a note via a JLD2 issue.


#3

Check out Parquet.jl – should be a bit easier to get going with than the java stack.


#4

Parquet.jl is actually one of the examples that the original post was referring to as the poor state of this stuff, it currently is broken on julia 0.6 (see https://github.com/JuliaComputing/Parquet.jl/issues/7).


#5

But updating Parquet.jl to work with Julia 0.6 may still be the easiest option.


#6

The installation of Parquet and its dependencies does not seem to be easy, and I have to say I am afraid of it. Meanwhile, we have tried flat-buffers and they seems to be OK. Reading is reasonably fast and it is possible to write the files straight from the java / scala. I just hope the implementation will be stable without memory leaks. Otherwise, the back-up solution is HDF5.


#7

Yeah, I hope someone tackles that, if we had a working version of that, it seems it would be the best option.


#8

What operating system are you using? I’ve just got it working on Ubuntu 16.04 - with lots of warnings, but able to load a parquet file generated with Spark.


#9

Made a couple of PRs to fix it:

There are several more warnings when calling using Parquet, but all tests pass on my machine.


#10

Thanks a lot.

I wanted to point to this because it is something what my people drive-off julia.


#11

I looked at the Windows situation today. The first blocker seems to be that Snappy.jl doesn’t have Windows support, right?

With Thrift.jl, am I right that things might actually work even if the thrift compiler stuff doesn’t work on Windows? Isn’t that more of a dev time step to generate julia files, but that wouldn’t have to work on user machines, right? But, I didn’t look at it very long, so I might have completely misunderstood :slight_smile:


#12

I guess another alternative might be the other Snappy.jl package on github, which seems a pure julia implementation without any binary dependency, but it is not clear to me how ready that one is.


#13

Can you use Blosc instead?


#14

The JLD2 issue appears to be fixed now.


#15

Hm, that would be nice. The README of Blosc.jl says that the only algorithms currently supported are blosclz, lz4, and lz4hc, and I think for Parquet.jl we would need the Snappy algorithm. But the original (non-julia) library seems to support snappy, so maybe that is something that could be made to work.


#16

I was wondering more whether it could use the Blosc algorithm itself.


#17

You can add support for Parquet + Blosc, but to read existing files you still need to support Parquet + Snappy.