Any general ideas about reducing GC time involving DataFrames?

liuyxpp · October 9, 2023, 1:35am

Say I have thousands of CSV files to be processed. For each, I will read the file into a Dataframe, process it, and then store the results in another DataFrame. During this process, it seems two main allocations (one for reading and one for storing) are unavoidable. I use multi-threads to process these files and the result df for each file is pushed to the Channel. I then collected all result dfs and vcat them using take!. It seems almost time (70%+) is spent on GC. How can I improve this situation?

Oscar_Smith · October 9, 2023, 1:41am

One posibility: do you need to vcat them all together? This will be problem dependent, but you often can do your analysis on the chunks separately. The other thing that might help is julia 1.10 (currently in beta) which adds multi-threaded GC.

liuyxpp · October 9, 2023, 1:57am

I have already used Julia 1.10.

Sometimes I don’t need to vcat them: each chunk is just a single training sample. I will try it. Thanks!

Topic		Replies	Views
Methods to reduce gc time? Performance	7	8022	February 7, 2018
DataFrames - reduce allocations and improve speed Data question	5	869	May 22, 2022
Multithreaded CSV writes Performance multithreading , csv	20	3462	April 14, 2023
Aggresive garbage collection behavior with DataFrames in 1.10? Performance	0	170	June 18, 2024
Help with optimizing GC time with large objects in memory Performance	3	657	November 10, 2018

Any general ideas about reducing GC time involving DataFrames?

Related topics