I am trying to create a large data array from data in many input files by
- reading in the individual data file arrays
- concatenating a subset of columns to generate a large, single array
The problem is that it is blazingly fast for the first, say, 100 files (depends on how I try to concatenate) but then each subsequent iteration slows down dramatically. So, I’m obviously doing something very inefficient - I’d assume trying to use vcat()
would be slow because it would keep allocating to pad the current array. For this reason, I instead allocate the entire final array up front and try to fill in the data as it is read from individual files.
I include an MWE example below - because all test arrays are small, it is very quick - but I’m hoping someone will see what is bad about the approach as I need to to run quickly for much larger arrays (individual arrays read in are on the order to 15000 x 1000).
using Glob
n_columns = 100
n_rows = 100
# create fake data
for i in 1:100
d = rand(n_rows,n_columns)
write("file_$(i).bin",d)
end
idx_cols_keep = 10:20
data_in = Array{Float32,2}(undef,n_rows,n_columns);
# yes order will be "wrong" comapred to indices but this is just an MWE to demo the idea
filelist = glob("*.bin",".");
nrows_total = n_rows*length(filelist)
data_total = Array{Float32,2}(undef,nrows_total,length(idx_cols_keep));
i0 = 1
nmax = length(filelist)
for i in 1:nmax
read!(filelist[i],data_in)
data_total[i0:i0+n_rows-1,:] .= data_in[:,idx_cols_keep]
i0 += n_rows
end