this is my problem:
I have files of the same type in year folders, e.g.:
./2024/file1.csv
./2024/file2.csv
./2023/file1.csv
./2023/file2.csv
etc.
I want to fill a dataframe with all this data.
I already use readdir in two loops (folders, then files) to load (and process on the fly) the data and push!() into the df.
As I have a lot of files to load, I would like to speed up the process.
I tried to use threads without success.
I also considered using an array of df and merging them after, but I fear that it would consume too much memory while merging.
Maybe? I’m not sure I understand what you are trying to do. Your original question seemed to me to just be about reading a bunch of csv files in parallel, to which my answer was that CSV.jl will already use all threads for the reading of each file, so I wouldn’t expect any speed ups from parallelizing across files.
Perhaps it’s more disk / operating system / filesystem bound? Maybe a faster disk? Maybe a RAM drive? I’m a great fan of BeeGFS, relatively easy to set up and use, especially in case the size and number of CSV files is really large?
Also do you have to do this more than once (and if so why?) - sounds more like something you’d do once, than save down the resulting data in a more useful format like Arrow.
Maybe? Also, in case there is a need for data format transformations, and those are recurring workflows, how about a database? For example, QuestDB natively supports Parquet, Umbra combines In-Memory and SSD tricks and there is also DuckDB and ClickHouse? What do you think?
The bottleneck is that I load and process files sequentially, where it could be parallel, i.e. loading each year simultaneously.
But it seems that parallel writing the same df is not a good idea.
So, maybe Distributed.jl, what do you think if I may ask? P.S. BTW, are you already running a distributed filesystem? If not, as I understand your original question, even if you manage to process in parallel, you still won’t be able to read data the same way.