How well Apache Arrow’s zero copy methodology is supported?

nilshg · April 23, 2021, 9:27am

I’d love to file issues, although my problem is that generally I’m working on proprietary data. Also, I struggle to get a good sense of what I should expect, i.e. what could be considered “really off”.

If you’ll indulge me, maybe take the operation I’m doing here. I’ve got this 28m-by-34 table, and am filtering on a single column. Column types are mainly strings, dates, and a few floats and ints. I then do the filter operation above, which removes around 7m of the 28m rows. What should I expect for this? Should the memory footprint be that of a 21m-by-34 table fully materialized?

For context, I can do

julia @time at = @view at[at.Week .<= Date(2016, 11, 14), :];
0.574345 seconds (399.45 k allocations: 185.803 MiB, 29.78% gc time, 44.69% compilation time)

on the same table. So here it appears I have successfully filtered out everything without allocating any memory at all. If I then do

@time at = copy(at);
215.217447 seconds (486.05 M allocations: 20.402 GiB, 79.84% gc time, 0.46% compilation time)

again all my memory is eaten up (17GB that I had spare when calling copy) and most of the time of the operation is spent in swap. The table in the view was only 21/28 the size of the original table, so I would have expected it to be ~4GB on disk. Why would it take 17GB + plenty of swap to copy this table? And why was this copy operation one minute slower than the direct filter above, if filter materializes the same in the end?

Sorry if I’m not being super helpful, I just really struggle to develop a good mental model of Arrow + DataFrames performance and which operations should and shouldn’t be fast. Overall I find the progress that’s been made on this front mindblowing in terms of what it allows me to do without resorting to low-level trickery or usage of much more complicated out-of-memory distributed tools, it’s just that sometimes randomly things eat up my memory or take forever and I can’t figure out what’s going on.

Topic		Replies	Views
Arrow stream usage clarification Data dataframes , arrow	10	1549	July 17, 2023
An example of Apache Arrow file? Data arrow	7	2866	April 22, 2021
Help with Arrow.jl and size of files Data question , arrow	23	1885	October 21, 2022
Writing Arrow files by column Performance	1	167	May 8, 2024
Optimize Apache Arrow data streaming over HTTP Performance web , http , arrow	29	1349	October 12, 2024

How well Apache Arrow’s zero copy methodology is supported?

Related topics