Parallel feeding of a dataframe

Ju_ska · March 17, 2025, 3:13pm

Hi,

this is my problem:
I have files of the same type in year folders, e.g.:

./2024/file1.csv
./2024/file2.csv
./2023/file1.csv
./2023/file2.csv
etc.

I want to fill a dataframe with all this data.

I already use readdir in two loops (folders, then files) to load (and process on the fly) the data and push!() into the df.
As I have a lot of files to load, I would like to speed up the process.
I tried to use threads without success.
I also considered using an array of df and merging them after, but I fear that it would consume too much memory while merging.

What would be a fair and square method ?

nilshg · March 17, 2025, 3:25pm

If you just CSV.read them all, CSV will create a single DataFrame for you, and use all available threads already?

Ju_ska · March 17, 2025, 3:30pm

I need to process all lines in all the files. Wouldn’t it be slower to reprocess the df afterwards ?

nilshg · March 17, 2025, 3:35pm

Maybe? I’m not sure I understand what you are trying to do. Your original question seemed to me to just be about reading a bunch of csv files in parallel, to which my answer was that CSV.jl will already use all threads for the reading of each file, so I wouldn’t expect any speed ups from parallelizing across files.

j_u · March 17, 2025, 4:30pm

Perhaps it’s more disk / operating system / filesystem bound? Maybe a faster disk? Maybe a RAM drive? I’m a great fan of BeeGFS, relatively easy to set up and use, especially in case the size and number of CSV files is really large?

nilshg · March 17, 2025, 5:17pm

Also do you have to do this more than once (and if so why?) - sounds more like something you’d do once, than save down the resulting data in a more useful format like Arrow.

jling · March 17, 2025, 5:17pm

maybe don’t use CSV files in this case. Arrow.jl or HDF5.jl

j_u · March 17, 2025, 5:30pm

Maybe? Also, in case there is a need for data format transformations, and those are recurring workflows, how about a database? For example, QuestDB natively supports Parquet, Umbra combines In-Memory and SSD tricks and there is also DuckDB and ClickHouse? What do you think?

Ju_ska · March 18, 2025, 9:19am

The bottleneck is that I load and process files sequentially, where it could be parallel, i.e. loading each year simultaneously.
But it seems that parallel writing the same df is not a good idea.

j_u · March 18, 2025, 11:23am

So, maybe Distributed.jl, what do you think if I may ask? P.S. BTW, are you already running a distributed filesystem? If not, as I understand your original question, even if you manage to process in parallel, you still won’t be able to read data the same way.

j_u · March 18, 2025, 5:30pm

Hey @Ju_ska, in case you would have additional information or need additional assistance, please let us know, we are always ready to help.

Ju_ska · March 19, 2025, 10:06am

There’s not much I can add.
Parallel load + process + df fill to avoid cpu starvation.

j_u · March 19, 2025, 12:25pm

So far, I don’t see any additional information provided, so perhaps I’ll clarify my point of view. As I understand it, the question was about “parallel feeding of a dataframe,” with the additional detail that the data is “inside CSV files.” I still maintain that a parallel filesystem, is the closest solution that comes to mind to address this problem. As for the other suggestions, short summary below:

Apache Arrow is an in-memory columnar data format. Very good Julia support.

Apache Parquet is a columnar storage file format. Very good Julia support.

HDF5 is a file format designed for storing hierarchical data structures. Very good Julia support.

QuestDB is a high-performance, open-source time-series database optimized for time-stamped data. Julia support could be slightly better.

DuckDB is an embedded, in-process OLAP database designed for analytical workloads. Really great Julia support.

Umbra is a research-oriented, high-performance relational database derived from HyPer, focusing on in-memory processing and advanced query execution. As far as I know there is no Julia support. However, it is PostgreSQL compatible.

ClickHouse is a distributed, columnar OLAP database designed for large-scale analytics. It is is very fast and versatile, however, achieving its full performance can be really time consuming. As far as I know, two Julia packages are present.

Distributed is a Julia “package” providing tools for distributed parallel processing.

I’m very sorry, I’m afraid I don’t have anything to add at this very moment, however, maybe a MWE (Minimal Working Example) or maybe my colleagues might have some additional suggestions complementing the ones they provided above.

Ju_ska · March 20, 2025, 3:12pm

Thanks for your answers.

I currently do this sequentially:

load file -> process data -> add the result to a dataframe

then the same thing again and again…
And I see a very low cpu/disk load on my machine.

This would be faster in parallel:

load file -> process data ↓
load file -> process data ↓
load file -> process data ↓
load file -> process data ↓
                          |-> dataframe

j_u · March 20, 2025, 8:15pm

According to the additional information provided, it looks like a pretty light load. However, it’s hard to judge how large all that jazz is in general. As for the parallel processing, it might be faster in parallel, or it might not be. I’m afraid I don’t have much to add. Maybe a Lustre in-memory solution. However those are rather pretty advanced topics. As for the final part of your equation, maybe directed acyclic graphs or some POSIX routines could make things faster. Again, it’s very hard to make a judgment. Good luck.

Ju_ska · March 21, 2025, 9:35am

Ok, thanks everyone for your help.
I ended creating a loop per year handling its own dataframe + processes.
Then I use @threads in this loop.
This is quite efficient, all my CPU go 95%.

j_u · March 21, 2025, 12:57pm

Hey @Ju_ska, I am glad you solved your problem. Would you consider providing the basic code in the form of an MWE, along with additional details about the size of your CSV file and some basic information about the processing workflow? As I understand the problem, the relationship between loading time and processing time would be particularly informative. Otherwise, you know, we were just talking about parallel feeding of a DataFrame.

j_u · March 26, 2025, 12:39am

I must have misunderstood something here - sorry.

Topic		Replies	Views
Reading and processing Data files concurrently Data parallel	18	3865	September 20, 2017
Processing csv's in parallel General Usage question	8	1520	February 4, 2018
Read csv files slow Performance filesystem	13	1759	July 28, 2020
Efficiently using single large dataframe over multiple workers Performance	10	2451	June 15, 2018
What's the best way to work with millions of rows of data? Performance	7	2127	February 24, 2020

Parallel feeding of a dataframe

Related topics