After n-iteration, I can process all of my 100n columns and get all partial result.
The main problem is that I cannot find which package I can use for column-wise chunking. Can someone give me some directions?
Background: I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix.(Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 processes in MPI, then each process should read a 50k by 25k array.
Yes, it’d be helpful if you provided some more details here: what kind of format is your data in? csv? feather? excel? something else? Why is processing > 100 columns at a time too big? I ask because on my 5-year-old laptop, I can process certain csv files with 20,000 columns without much trouble.
In the CSV.jl package, a recent addition is the CSV.Rows type, which allows efficient iteration, row-by-row, over the values in a csv file. It even allows a reusebuffer=true keyword argument that will allocate a single buffer for the entire file to be re-used while iterating. So you could process an entire file by doing something like:
for row in CSV.Rows(filename; reusebuffer=true)
# do things with row values: row.col1, row.col2, etc. where `col1` is a column name in the csv file
end
Thanks for your reply. The data have both .txt file and .jld file.
I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix (Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 process in MPI, then each process should read a 50k by 25k array.
Do we have method to achieve that?