The poor state of fileformats for High Performance computing

Tomas_Pevny · August 11, 2017, 7:57am

About two weeks ago I have put a post asking for recommending the file-format use with high-perfomance computing. Our use-case is training of relatively simple neural networks but using terabytes of data. We do that on a single machine with GPU, where there are separate processes loading the data from S3 storage and preparing them for the master thread.

After our reasearch, we have come with following alternatives:
Feather
Parquet
ProtoBuf
JLD
JLD2
HDF5

Let’s discuss all options.
Feather is interesting, but it cannot handle unicode characters and sometime mysteriously crashes when reading large datafiles (we have filed bug for unicode characters).

We did not dare to try parquet, since the installation of the package is anything but trivial, and reading that using Spark.py means having separate Java process, which is something I want to avoid.

ProtoBuf is quite slow to read, even though we have fixed some type instabilities.

JLD contains memory leakage. The bug has been filed but due to transition to JLD2 is not addressed.

JLD2 crashes when saving large array of string (bug has been filed).

HDF5 seems to be the only format that is working (evaluation still under test, fingers crossed otherwise I am doomed and have to go to python, which is something I would rather avoid).

I have to say that this is not very nice situation. I like the Julia and being in a Python / Java environment is not a good feature. I think that having good and stable binary fileformat is important for any large-scale processing. Again, I am talking about terabyte scale data.

I do not know, where to move next. I am happy to help improving any package if my skills are sufficient, or at least do the testing.

Tomas

JeffreySarnoff · August 11, 2017, 11:15am

JLD2 would be the best first place for you to assist. I am not involved, but send a note via a JLD2 issue.

MikeInnes · August 11, 2017, 1:25pm

Check out Parquet.jl – should be a bit easier to get going with than the java stack.

davidanthoff · August 11, 2017, 4:29pm

Parquet.jl is actually one of the examples that the original post was referring to as the poor state of this stuff, it currently is broken on julia 0.6 (see Is this package still being maintained? · Issue #7 · JuliaIO/Parquet.jl · GitHub).

dfdx · August 11, 2017, 7:52pm

But updating Parquet.jl to work with Julia 0.6 may still be the easiest option.

Tomas_Pevny · August 11, 2017, 8:10pm

The installation of Parquet and its dependencies does not seem to be easy, and I have to say I am afraid of it. Meanwhile, we have tried flat-buffers and they seems to be OK. Reading is reasonably fast and it is possible to write the files straight from the java / scala. I just hope the implementation will be stable without memory leaks. Otherwise, the back-up solution is HDF5.

davidanthoff · August 11, 2017, 9:23pm

Yeah, I hope someone tackles that, if we had a working version of that, it seems it would be the best option.

dfdx · August 11, 2017, 9:25pm

What operating system are you using? I’ve just got it working on Ubuntu 16.04 - with lots of warnings, but able to load a parquet file generated with Spark.

dfdx · August 12, 2017, 12:01am

Made a couple of PRs to fix it:

Zlib.jl: https://github.com/dcjones/Zlib.jl/pull/32 (tests already passed with current master, I additionally fixed some depwarns)
Thrift.jl: https://github.com/tanmaykm/Thrift.jl/pull/28 (this should fix build on Ubuntu, macos should have been already supported, Windows is unlikely to come anytime soon)
Parquet.jl: https://github.com/JuliaComputing/Parquet.jl/pull/8

There are several more warnings when calling using Parquet, but all tests pass on my machine.

Tomas_Pevny · August 12, 2017, 5:44am

Thanks a lot.

I wanted to point to this because it is something what my people drive-off julia.

davidanthoff · August 12, 2017, 9:02pm

I looked at the Windows situation today. The first blocker seems to be that Snappy.jl doesn’t have Windows support, right?

With Thrift.jl, am I right that things might actually work even if the thrift compiler stuff doesn’t work on Windows? Isn’t that more of a dev time step to generate julia files, but that wouldn’t have to work on user machines, right? But, I didn’t look at it very long, so I might have completely misunderstood

davidanthoff · August 12, 2017, 9:32pm

I guess another alternative might be the other Snappy.jl package on github, which seems a pure julia implementation without any binary dependency, but it is not clear to me how ready that one is.

stevengj · August 12, 2017, 11:44pm

Can you use Blosc instead?

tkoolen · August 13, 2017, 2:22am

The JLD2 issue appears to be fixed now.

davidanthoff · August 13, 2017, 4:51am

Hm, that would be nice. The README of Blosc.jl says that the only algorithms currently supported are blosclz, lz4, and lz4hc, and I think for Parquet.jl we would need the Snappy algorithm. But the original (non-julia) library seems to support snappy, so maybe that is something that could be made to work.

stevengj · August 13, 2017, 1:14pm

I was wondering more whether it could use the Blosc algorithm itself.

dfdx · August 13, 2017, 6:55pm

You can add support for Parquet + Blosc, but to read existing files you still need to support Parquet + Snappy.

Topic		Replies	Views
Writing Parquet files General Usage	28	5253	November 12, 2020
What fileformat to use to load data for high performance computing Machine Learning	37	7005	December 1, 2018
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	862	August 31, 2022
Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project General Usage	57	3217	May 6, 2024
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7433	May 8, 2024

The poor state of fileformats for High Performance computing

Related topics