How to read big data chunk by chunk(column-wise chunking)?

Carol · June 8, 2019, 10:04pm

Hi, could you please tell me how to read data chunk by chunk?

For example, read 100 columns at one time, like:

sub_data=read_data(file, start_column=1,end_column=100)

After n-iteration, I can process all of my 100n columns and get all partial result.

The main problem is that I cannot find which package I can use for column-wise chunking. Can someone give me some directions?

Background: I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix.(Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 processes in MPI, then each process should read a 50k by 25k array.

Thanks

jpsamaroo · June 8, 2019, 11:03pm

Please read the following informative post and then update your post accordingly:

quinnj · June 9, 2019, 12:19am

Yes, it’d be helpful if you provided some more details here: what kind of format is your data in? csv? feather? excel? something else? Why is processing > 100 columns at a time too big? I ask because on my 5-year-old laptop, I can process certain csv files with 20,000 columns without much trouble.

In the CSV.jl package, a recent addition is the CSV.Rows type, which allows efficient iteration, row-by-row, over the values in a csv file. It even allows a reusebuffer=true keyword argument that will allocate a single buffer for the entire file to be re-used while iterating. So you could process an entire file by doing something like:

for row in CSV.Rows(filename; reusebuffer=true)
    # do things with row values: row.col1, row.col2, etc. where `col1` is a column name in the csv file
end

Hope that helps?

Juan · June 9, 2019, 12:38am

The package JuliaDB is supposed to be useful to perform some operations with large datasets.
https://github.com/JuliaComputing/JuliaDB.jl

Carol · June 9, 2019, 1:20am

Thanks for your reply. The data have both .txt file and .jld file.
I’m using MPI to achieve parallel computing, and I want each process to read part of the matrix (Columnwise block-striped matrix).
For example, if I have a 50k by 50k array, and I have 2 process in MPI, then each process should read a 50k by 25k array.
Do we have method to achieve that?

Thanks

stevengj · June 9, 2019, 4:53pm

HDF5 is designed for precisely this sort of thing.

Topic		Replies	Views
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	808	October 27, 2022
Need example of processing csv file in chunks New to Julia package , csv	3	1194	January 4, 2021
What's the best way to work with millions of rows of data? Performance	7	2086	February 24, 2020
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2054	February 23, 2024
Partition a large CSV file into smaller files without loading into memory General Usage question , csv , io	6	3683	March 10, 2019

How to read big data chunk by chunk(column-wise chunking)?

Related topics