Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead

ujjwal_singh · September 21, 2023, 2:12pm

Could someone please guide me on how to merge multiple large .arrow files (12 files totaling 2.5 TB) into a single DataFrame with minimal memory overhead for big data analysis? I have a total of 1 TB of RAM; can I manage data analysis with the existing RAM? Here’s a two-line snippet of my code:

filenames = [“ct1_first.arrow”, “ct1_snd.arrow”]
append!(filenames, [“ct$(i).arrow” for i in 2:11])

bkamins · September 21, 2023, 2:20pm

you can do (pseudocode):

df = DataFrame()
for arrow_table in vector_of_arrow_tables
    append!(df, arrow_table)
end

This should be efficient enough hopefully.

ujjwal_singh · September 21, 2023, 2:21pm

One more thing: all the .arrow files have the same columns, as shown in the attached screenshot

Palli · September 21, 2023, 3:28pm

I think Polar.jl might help, about to be registered (unless suggested name-change goes though; you can still install with that name):

github.com/JuliaRegistries/General

New package: Polars v0.1.0

JuliaRegistries:master ← JuliaRegistries:registrator-polars-111185ca-v0.1.0-abc5180616

opened 03:25PM - 08 Sep 23 UTC

JuliaRegistrator

+17 -0

- Registering package: Polars - Repository: https://github.com/Pangoraw/Polars.j…l - Created by: @Pangoraw - Version: v0.1.0 - Commit: 1860f4dd5a8750ffe2ab78268b72e7008cfe169d - Reviewed by: @Pangoraw - Reference: https://github.com/Pangoraw/Polars.jl/commit/1860f4dd5a8750ffe2ab78268b72e7008cfe169d#commitcomment-129076209 - Description: :bear: Julia wrapper around the polars library

It has out-of-core (and Arrow) support. There’s also Arrow.jl.

I wasn’t sure I thought it (and Arrows.jl? nor JuliaDB.jl) doesn’t have out-of-core? Am I right? Polars has been faster in some cases, and doesn’t run out of memory because of that I believe as fast?

Why do you think that then, or working at all? I guess you mean relying on paging/swapping. Likely ok, and efficient enough for a one-time thing this seems to be.

[FYI: also intriguing wrapper already registered RustRegex.jl, also wrapping fast Rust code; and other packages in the pipline e.g. [Julia] ObjectSystem.jl.]

ujjwal_singh · September 21, 2023, 4:35pm

This code is causing memory overload. Could you please suggest a more efficient solution that uses less memory, preferably within 1 TB of RAM?

ufechner7 · September 21, 2023, 4:43pm

On which OS are you? If you are on Linux, did you try zram?

ujjwal_singh · September 21, 2023, 4:44pm

I am using Debian 11 OS.

ufechner7 · September 21, 2023, 4:48pm

So try to enable zram and play with the settings, e.g. compression algorithm and size of compressed memory…

tbeason · September 21, 2023, 5:57pm

Two things:

make sure you are using the heap-size-hint flag when you start Julia
look into using DuckDB

ujjwal_singh · September 22, 2023, 6:30am

Your suggestion requires 637GB for processing a single file. I don’t think processing 12 files will fit within 1TB. The screenshot of the code is…

DrChainsaw · September 22, 2023, 6:36am

Isnt it so that Arrow.jl uses mmap:ed arrays, and those materialize in ram when concatenated into the big DataFrame?

I have tried using an ArrayPartition from RecursiveArrayTools.jl to try to prevent this column by column.

In my case the extra compiletime overhead made me abandon the attempt before getting to the bottom of whether it helped with memory to any significant degree in practice. On problem could for example be that almost any operation on the dataframe would cause the array to materialize.

If you have access to multiple machines there is GitHub - JuliaParallel/DTables.jl: Distributed table structures and data manipulation operations built on top of Dagger.jl although I have never used it.

bkamins · September 22, 2023, 6:46am

DTables.jl would also work on a single machine and is designed for these cases.

Your suggestion requires 637GB for processing a single file. I don’t think processing 12 files will fit within 1TB.

But this means that you cannot count to fit all 12 files into a single data frame. You need an off-core solution then (like DTables.jl). Also the question is that maybe it is enough for you to process the data table by table and then merge the results?

ujjwal_singh · September 22, 2023, 7:18am

I want to merge all the dataframes into a single dataframe before processing.

bkamins · September 22, 2023, 7:47am

I understand, but following what you have said, you do not have enough RAM to do so as your 12 tables in total have more than 1TB of data (unless I misunderstood something). That is why I ask if processing them can be done using map-reduce (or similar) pattern.

Note that any off-core solution (like DTables.jl or other technologies, e.g. databases) anyway will have to do some kind of map-reduce (or similar approach) on your data to process it as it does not fit into RAM.

tbeason · September 22, 2023, 9:17am

I think you have a bit more to learn regarding analyzing big data. You are trying to build the entire table and you just can’t do that. What you want to be doing is to query the whole dataset but materialize only the result of the query.

Topic		Replies	Views
Fastest way to save a large number of DataFrames to disk Performance dataframes , io , arrow	2	516	May 10, 2024
How well Apache Arrow’s zero copy methodology is supported? Data arrow	24	2661	May 1, 2021
Is it possible to join DataFrame with Arrow Table ensuring unique rows without bringing Arrow Table into RAM? New to Julia question , dataframes , arrow	3	652	April 3, 2023
Writing Arrow record batch requires a lot of RAM Data arrow	5	996	September 19, 2021
An example of Apache Arrow file? Data arrow	7	2885	April 22, 2021

Efficiently merge multiple .arrow files (12 files totaling 2.5 TB) into a single DataFrame for effective data analysis while minimizing memory overhead

Related topics