How to read big data chunk by chunk(column-wise chunking)?

Hi, could you please tell me how to read data chunk by chunk?

For example, read 100 columns at one time, like:

sub_data=read_data(file, start_column=1,end_column=100)

After n-iteration, I can process all of my 100n columns and get all partial result.

The main problem is that I cannot find which package I can use for column-wise chunking. Can someone give me some directions?


Background: I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix.(Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 processes in MPI, then each process should read a 50k by 25k array.

Thanks

Please read the following informative post and then update your post accordingly:

1 Like

Yes, it’d be helpful if you provided some more details here: what kind of format is your data in? csv? feather? excel? something else? Why is processing > 100 columns at a time too big? I ask because on my 5-year-old laptop, I can process certain csv files with 20,000 columns without much trouble.

In the CSV.jl package, a recent addition is the CSV.Rows type, which allows efficient iteration, row-by-row, over the values in a csv file. It even allows a reusebuffer=true keyword argument that will allocate a single buffer for the entire file to be re-used while iterating. So you could process an entire file by doing something like:

for row in CSV.Rows(filename; reusebuffer=true)
    # do things with row values: row.col1, row.col2, etc. where `col1` is a column name in the csv file
end

Hope that helps?

2 Likes

The package JuliaDB is supposed to be useful to perform some operations with large datasets.

Thanks for your reply. The data have both .txt file and .jld file.
I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix (Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 process in MPI, then each process should read a 50k by 25k array.
Do we have method to achieve that?

Thanks

HDF5 is designed for precisely this sort of thing.

4 Likes