I’d love to file issues, although my problem is that generally I’m working on proprietary data. Also, I struggle to get a good sense of what I should expect, i.e. what could be considered “really off”.
If you’ll indulge me, maybe take the operation I’m doing here. I’ve got this 28m-by-34 table, and am filtering on a single column. Column types are mainly strings, dates, and a few floats and ints. I then do the filter operation above, which removes around 7m of the 28m rows. What should I expect for this? Should the memory footprint be that of a 21m-by-34 table fully materialized?
For context, I can do
julia @time at = @view at[at.Week .<= Date(2016, 11, 14), :];
0.574345 seconds (399.45 k allocations: 185.803 MiB, 29.78% gc time, 44.69% compilation time)
on the same table. So here it appears I have successfully filtered out everything without allocating any memory at all. If I then do
@time at = copy(at);
215.217447 seconds (486.05 M allocations: 20.402 GiB, 79.84% gc time, 0.46% compilation time)
again all my memory is eaten up (17GB that I had spare when calling copy
) and most of the time of the operation is spent in swap. The table in the view was only 21/28 the size of the original table, so I would have expected it to be ~4GB on disk. Why would it take 17GB + plenty of swap to copy this table? And why was this copy operation one minute slower than the direct filter
above, if filter
materializes the same in the end?
Sorry if I’m not being super helpful, I just really struggle to develop a good mental model of Arrow + DataFrames performance and which operations should and shouldn’t be fast. Overall I find the progress that’s been made on this front mindblowing in terms of what it allows me to do without resorting to low-level trickery or usage of much more complicated out-of-memory distributed tools, it’s just that sometimes randomly things eat up my memory or take forever and I can’t figure out what’s going on.