Feather.jl, and understanding when data is loaded into RAM

ExpandingMan · April 11, 2017, 4:50pm

Like a lot of people, I have a great deal of use for binary serialization formats for storing tabular data. I really love Feather which is based on the new Apache Arrow format. However, its usefulness is extremely limited to me unless I can (at the very least) pull only 1 data feature from it at a time.

I’ve taken a look at what’s been done there, and I’m somewhat confused about under what circumstances data is actually being read off the disk (and into RAM). The current implementation memory maps a file to a Vector{UInt8}. I’m a little confused about when data coming from this vector is actually being read off of disk. When reading columns, somewhere in the code an unsafe_wrap(Array, ptr, nrows) is called where ptr is a pointer pointing to some location within the Vector{UInt8}. Does Julia necessarily read data off the disk when it’s doing this, or does that happen later? At some point T[x for x in A] is called, which I definitely think must read things off disk.

Admittedly my understanding of how memory mapping works is quite poor, so that’s probably causing much of my confusion.

Any help that could move me toward reading chunks of my feather files at a time off disk would be much appreciated! Thanks.

Topic		Replies	Views
ANN: Feather.jl v0.4.0 (lazy edition) Data	2	1024	August 29, 2018
Reading large-columned data using Feather.jl is too slow Data question , package	8	738	June 28, 2020
Help with Arrow.jl and size of files Data question , arrow	23	1900	October 21, 2022
Reading Data Is Still Too Slow Data	35	8824	August 2, 2019
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6973	October 25, 2018

Feather.jl, and understanding when data is loaded into RAM

Related topics