Parallel feeding of a dataframe

Hi,

this is my problem:
I have files of the same type in year folders, e.g.:

./2024/file1.csv
./2024/file2.csv
./2023/file1.csv
./2023/file2.csv
etc.

I want to fill a dataframe with all this data.

I already use readdir in two loops (folders, then files) to load (and process on the fly) the data and push!() into the df.
As I have a lot of files to load, I would like to speed up the process.
I tried to use threads without success.
I also considered using an array of df and merging them after, but I fear that it would consume too much memory while merging.

What would be a fair and square method ?

1 Like

If you just CSV.read them all, CSV will create a single DataFrame for you, and use all available threads already?

1 Like

I need to process all lines in all the files. Wouldn’t it be slower to reprocess the df afterwards ?

Maybe? I’m not sure I understand what you are trying to do. Your original question seemed to me to just be about reading a bunch of csv files in parallel, to which my answer was that CSV.jl will already use all threads for the reading of each file, so I wouldn’t expect any speed ups from parallelizing across files.

1 Like

Perhaps it’s more disk / operating system / filesystem bound? Maybe a faster disk? Maybe a RAM drive? I’m a great fan of BeeGFS, relatively easy to set up and use, especially in case the size and number of CSV files is really large?

Also do you have to do this more than once (and if so why?) - sounds more like something you’d do once, than save down the resulting data in a more useful format like Arrow.

maybe don’t use CSV files in this case. Arrow.jl or HDF5.jl

1 Like

Maybe? Also, in case there is a need for data format transformations, and those are recurring workflows, how about a database? For example, QuestDB natively supports Parquet, Umbra combines In-Memory and SSD tricks and there is also DuckDB and ClickHouse? What do you think?

The bottleneck is that I load and process files sequentially, where it could be parallel, i.e. loading each year simultaneously.
But it seems that parallel writing the same df is not a good idea.

So, maybe Distributed.jl, what do you think if I may ask? P.S. BTW, are you already running a distributed filesystem? If not, as I understand your original question, even if you manage to process in parallel, you still won’t be able to read data the same way.

Hey @Ju_ska, in case you would have additional information or need additional assistance, please let us know, we are always ready to help. :slight_smile:

There’s not much I can add.
Parallel load + process + df fill to avoid cpu starvation.

So far, I don’t see any additional information provided, so perhaps I’ll clarify my point of view. As I understand it, the question was about “parallel feeding of a dataframe,” with the additional detail that the data is “inside CSV files.” I still maintain that a parallel filesystem, is the closest solution that comes to mind to address this problem. As for the other suggestions, short summary below:

Apache Arrow is an in-memory columnar data format. Very good Julia support.

Apache Parquet is a columnar storage file format. Very good Julia support.

HDF5 is a file format designed for storing hierarchical data structures. Very good Julia support.

QuestDB is a high-performance, open-source time-series database optimized for time-stamped data. Julia support could be slightly better.

DuckDB is an embedded, in-process OLAP database designed for analytical workloads. Really great Julia support.

Umbra is a research-oriented, high-performance relational database derived from HyPer, focusing on in-memory processing and advanced query execution. As far as I know there is no Julia support. However, it is PostgreSQL compatible.

ClickHouse is a distributed, columnar OLAP database designed for large-scale analytics. It is is very fast and versatile, however, achieving its full performance can be really time consuming. As far as I know, two Julia packages are present.

Distributed is a Julia “package” providing tools for distributed parallel processing.

I’m very sorry, I’m afraid I don’t have anything to add at this very moment, however, maybe a MWE (Minimal Working Example) or maybe my colleagues might have some additional suggestions complementing the ones they provided above.