Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead

Could someone please guide me on how to merge multiple large .arrow files (12 files totaling 2.5 TB) into a single DataFrame with minimal memory overhead for big data analysis? I have a total of 1 TB of RAM; can I manage data analysis with the existing RAM? Here’s a two-line snippet of my code:

filenames = [“ct1_first.arrow”, “ct1_snd.arrow”]
append!(filenames, [“ct$(i).arrow” for i in 2:11])

1 Like

you can do (pseudocode):

df = DataFrame()
for arrow_table in vector_of_arrow_tables
    append!(df, arrow_table)
end

This should be efficient enough hopefully.

2 Likes

One more thing: all the .arrow files have the same columns, as shown in the attached screenshot

I think Polar.jl might help, about to be registered (unless suggested name-change goes though; you can still install with that name):

It has out-of-core (and Arrow) support. There’s also Arrow.jl.

I wasn’t sure I thought it (and Arrows.jl? nor JuliaDB.jl) doesn’t have out-of-core? Am I right? Polars has been faster in some cases, and doesn’t run out of memory because of that I believe as fast?

Why do you think that then, or working at all? I guess you mean relying on paging/swapping. Likely ok, and efficient enough for a one-time thing this seems to be.

[FYI: also intriguing wrapper already registered RustRegex.jl, also wrapping fast Rust code; and other packages in the pipline e.g. [Julia] ObjectSystem.jl.]


This code is causing memory overload. Could you please suggest a more efficient solution that uses less memory, preferably within 1 TB of RAM?

On which OS are you? If you are on Linux, did you try zram?

1 Like

I am using Debian 11 OS.

So try to enable zram and play with the settings, e.g. compression algorithm and size of compressed memory…

1 Like

Two things:

  • make sure you are using the heap-size-hint flag when you start Julia
  • look into using DuckDB

Your suggestion requires 637GB for processing a single file. I don’t think processing 12 files will fit within 1TB. The screenshot of the code is…

Isnt it so that Arrow.jl uses mmap:ed arrays, and those materialize in ram when concatenated into the big DataFrame?

I have tried using an ArrayPartition from RecursiveArrayTools.jl to try to prevent this column by column.

In my case the extra compiletime overhead made me abandon the attempt before getting to the bottom of whether it helped with memory to any significant degree in practice. On problem could for example be that almost any operation on the dataframe would cause the array to materialize.

If you have access to multiple machines there is GitHub - JuliaParallel/DTables.jl: Distributed table structures and data manipulation operations built on top of Dagger.jl although I have never used it.

DTables.jl would also work on a single machine and is designed for these cases.

Your suggestion requires 637GB for processing a single file. I don’t think processing 12 files will fit within 1TB.

But this means that you cannot count to fit all 12 files into a single data frame. You need an off-core solution then (like DTables.jl). Also the question is that maybe it is enough for you to process the data table by table and then merge the results?

I want to merge all the dataframes into a single dataframe before processing.

I understand, but following what you have said, you do not have enough RAM to do so as your 12 tables in total have more than 1TB of data (unless I misunderstood something). That is why I ask if processing them can be done using map-reduce (or similar) pattern.

Note that any off-core solution (like DTables.jl or other technologies, e.g. databases) anyway will have to do some kind of map-reduce (or similar approach) on your data to process it as it does not fit into RAM.

1 Like

I think you have a bit more to learn regarding analyzing big data. You are trying to build the entire table and you just can’t do that. What you want to be doing is to query the whole dataset but materialize only the result of the query.

2 Likes