Say I have thousands of CSV files to be processed. For each, I will read the file into a Dataframe, process it, and then store the results in another DataFrame. During this process, it seems two main allocations (one for reading and one for storing) are unavoidable. I use multi-threads to process these files and the result df for each file is pushed to the Channel. I then collected all result dfs and vcat them using take!
. It seems almost time (70%+) is spent on GC. How can I improve this situation?
One posibility: do you need to vcat
them all together? This will be problem dependent, but you often can do your analysis on the chunks separately. The other thing that might help is julia 1.10 (currently in beta) which adds multi-threaded GC.
I have already used Julia 1.10.
Sometimes I don’t need to vcat
them: each chunk is just a single training sample. I will try it. Thanks!