Reading large-columned data using Feather.jl is too slow

Jean · May 28, 2020, 8:56pm

Hi, I am working with large-columned data sets. One small example is the dimension of (1452, 66584) whose data size is about 2GB. When it converted to a feather format, its size was down to 778MB. The problem is Feather.jl is unexpectedly slow for first reading, taking fairly large memory allocation, so I failed several times in ACF for memory issue and unexpectedly fast after that. Here is the output I succeeded in the following local machine:

Julia> using Feather

Julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
1850.519921 seconds (17.67 G allocations: 363.156 GiB, 1.80% gc time)

julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
  3.102820 seconds (17.18 M allocations: 575.066 MiB, 7.20% gc time)

Julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)

Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

This file is not the only to work with; it is one of the files I jointly work with. Do you have any idea to read large-columned data fast?

Thanks.

xiaodai · May 29, 2020, 1:23am

Do you have lots of missings and lots of String columns?

Well you could try JDF.jl if interop with Python and R is not a big priority. It is generally quite fast for me (I am biased cos I developed it).

Parquet.jl’s reader and writer are not the best in terms of performance atm.

tbeason · May 29, 2020, 1:39am

This will be at least partially true for any file format reader. I see this with CSV, SASLib, etc…

Although I will say that 1850 seconds is a bit of a stretch! I read files in that are much larger than 2GB (again, CSV or SASLib) and I never see those kind of times. 2-3 minutes tops.

Jean · May 29, 2020, 2:25am

The datasets are float64 and no missing. When compared with large-rowed data, my case is fairly slow.

xiaodai · May 29, 2020, 2:28am

In that case JDF.jl will perform very well and is suited to your use case.

MarkovChains · June 27, 2020, 10:48pm

Let us know if JDF.jl solves the issue!

Jean · June 28, 2020, 3:24am

I tried to use JDF.jl but it wasn’t satisfiable. In my group, my colleague developed a new pkg ‘Helium. jl’ to fix this issue. It will soon be released.

xiaodai · June 28, 2020, 4:07am

You mean you couldn’t install it or the feature are not up to scratch, in terms of speed or usage? I am interested to know what are the failing if you can be so kind to volunteer your time to answer my question.

I try to make JDF better.

xiaodai · June 28, 2020, 4:09am

I can’t find the repo at all. Is it on github?

Topic		Replies	Views
Reading Data Is Still Too Slow Data	35	8824	August 2, 2019
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6973	October 25, 2018
JDF - an experimental DataFrame serialization format is ready for beta testing Data	8	2003	September 15, 2019
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6149	August 26, 2019
ANN: Feather.jl v0.4.0 (lazy edition) Data	2	1024	August 29, 2018

Reading large-columned data using Feather.jl is too slow

Related topics