How to read big data chunk by chunk(column-wise chunking)?

Hi, could you please tell me how to read data chunk by chunk?

For example, read 100 columns at one time, like:

sub_data=read_data(file, start_column=1,end_column=100)

After n-iteration, I can process all of my 100n columns and get all partial result.

The main problem is that I cannot find which package I can use for column-wise chunking. Can someone give me some directions?


Background: I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix.(Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 processes in MPI, then each process should read a 50k by 25k array.

Thanks

Please read the following informative post and then update your post accordingly:

1 Like

Yes, it’d be helpful if you provided some more details here: what kind of format is your data in? csv? feather? excel? something else? Why is processing > 100 columns at a time too big? I ask because on my 5-year-old laptop, I can process certain csv files with 20,000 columns without much trouble.

In the CSV.jl package, a recent addition is the CSV.Rows type, which allows efficient iteration, row-by-row, over the values in a csv file. It even allows a reusebuffer=true keyword argument that will allocate a single buffer for the entire file to be re-used while iterating. So you could process an entire file by doing something like:

for row in CSV.Rows(filename; reusebuffer=true)
    # do things with row values: row.col1, row.col2, etc. where `col1` is a column name in the csv file
end

Hope that helps?

2 Likes

The package JuliaDB is supposed to be useful to perform some operations with large datasets.
https://github.com/JuliaComputing/JuliaDB.jl

Thanks for your reply. The data have both .txt file and .jld file.
I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix (Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 process in MPI, then each process should read a 50k by 25k array.
Do we have method to achieve that?

Thanks

HDF5 is designed for precisely this sort of thing.

4 Likes