Feather.jl, and understanding when data is loaded into RAM


#1

Like a lot of people, I have a great deal of use for binary serialization formats for storing tabular data. I really love Feather which is based on the new Apache Arrow format. However, its usefulness is extremely limited to me unless I can (at the very least) pull only 1 data feature from it at a time.

I’ve taken a look at what’s been done there, and I’m somewhat confused about under what circumstances data is actually being read off the disk (and into RAM). The current implementation memory maps a file to a Vector{UInt8}. I’m a little confused about when data coming from this vector is actually being read off of disk. When reading columns, somewhere in the code an unsafe_wrap(Array, ptr, nrows) is called where ptr is a pointer pointing to some location within the Vector{UInt8}. Does Julia necessarily read data off the disk when it’s doing this, or does that happen later? At some point T[x for x in A] is called, which I definitely think must read things off disk.

Admittedly my understanding of how memory mapping works is quite poor, so that’s probably causing much of my confusion.

Any help that could move me toward reading chunks of my feather files at a time off disk would be much appreciated! Thanks.