Reading large-columned data using Feather.jl is too slow

Hi, I am working with large-columned data sets. One small example is the dimension of (1452, 66584) whose data size is about 2GB. When it converted to a feather format, its size was down to 778MB. The problem is Feather.jl is unexpectedly slow for first reading, taking fairly large memory allocation, so I failed several times in ACF for memory issue and unexpectedly fast after that. Here is the output I succeeded in the following local machine:

Julia> using Feather

Julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
1850.519921 seconds (17.67 G allocations: 363.156 GiB, 1.80% gc time)

julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
  3.102820 seconds (17.18 M allocations: 575.066 MiB, 7.20% gc time)
Julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)

Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

This file is not the only to work with; it is one of the files I jointly work with. Do you have any idea to read large-columned data fast?

Thanks.

Do you have lots of missings and lots of String columns?

Well you could try JDF.jl if interop with Python and R is not a big priority. It is generally quite fast for me (I am biased cos I developed it).

Parquet.jl’s reader and writer are not the best in terms of performance atm.

This will be at least partially true for any file format reader. I see this with CSV, SASLib, etc…

Although I will say that 1850 seconds is a bit of a stretch! I read files in that are much larger than 2GB (again, CSV or SASLib) and I never see those kind of times. 2-3 minutes tops.

The datasets are float64 and no missing. When compared with large-rowed data, my case is fairly slow.

In that case JDF.jl will perform very well and is suited to your use case.

Let us know if JDF.jl solves the issue!

I tried to use JDF.jl but it wasn’t satisfiable. In my group, my colleague developed a new pkg ‘Helium. jl’ to fix this issue. It will soon be released.

You mean you couldn’t install it or the feature are not up to scratch, in terms of speed or usage? I am interested to know what are the failing if you can be so kind to volunteer your time to answer my question.

I try to make JDF better.

I can’t find the repo at all. Is it on github?