Hi, I am working with large-columned data sets. One small example is the dimension of (1452, 66584) whose data size is about 2GB. When it converted to a feather format, its size was down to 778MB. The problem is Feather.jl is unexpectedly slow for first reading, taking fairly large memory allocation, so I failed several times in ACF for memory issue and unexpectedly fast after that. Here is the output I succeeded in the following local machine:
Julia> using Feather
Julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
1850.519921 seconds (17.67 G allocations: 363.156 GiB, 1.80% gc time)
julia> @time al=Feather.read("DO_gm_ofa_unadj_alpr_ch1.feather");
3.102820 seconds (17.18 M allocations: 575.066 MiB, 7.20% gc time)
Julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
This file is not the only to work with; it is one of the files I jointly work with. Do you have any idea to read large-columned data fast?
This will be at least partially true for any file format reader. I see this with CSV, SASLib, etc…
Although I will say that 1850 seconds is a bit of a stretch! I read files in that are much larger than 2GB (again, CSV or SASLib) and I never see those kind of times. 2-3 minutes tops.
I tried to use JDF.jl but it wasn’t satisfiable. In my group, my colleague developed a new pkg ‘Helium. jl’ to fix this issue. It will soon be released.
You mean you couldn’t install it or the feature are not up to scratch, in terms of speed or usage? I am interested to know what are the failing if you can be so kind to volunteer your time to answer my question.