Optimizing concatenation of data from large number of files

I am trying to create a large data array from data in many input files by

  1. reading in the individual data file arrays
  2. concatenating a subset of columns to generate a large, single array

The problem is that it is blazingly fast for the first, say, 100 files (depends on how I try to concatenate) but then each subsequent iteration slows down dramatically. So, I’m obviously doing something very inefficient - I’d assume trying to use vcat() would be slow because it would keep allocating to pad the current array. For this reason, I instead allocate the entire final array up front and try to fill in the data as it is read from individual files.

I include an MWE example below - because all test arrays are small, it is very quick - but I’m hoping someone will see what is bad about the approach as I need to to run quickly for much larger arrays (individual arrays read in are on the order to 15000 x 1000).

using Glob

n_columns = 100
n_rows = 100
# create fake data
for i in 1:100
    d = rand(n_rows,n_columns)
    write("file_$(i).bin",d)
end

idx_cols_keep = 10:20

data_in = Array{Float32,2}(undef,n_rows,n_columns);
# yes order will be "wrong" comapred to indices but this is just an MWE to demo the idea
filelist = glob("*.bin",".");

nrows_total = n_rows*length(filelist)

data_total = Array{Float32,2}(undef,nrows_total,length(idx_cols_keep));

i0 = 1
nmax = length(filelist)
for i in 1:nmax
    read!(filelist[i],data_in)
    data_total[i0:i0+n_rows-1,:] .= data_in[:,idx_cols_keep]
    i0 += n_rows
end

I would suggest trying to put the code in a function as opposed to in global scope. The compiler will think much harder about the code that way, and can possibly optimize memory usage better.
See https://docs.julialang.org/en/v1/manual/performance-tips/

2 Likes

You were right - I was stupidly assuming that this was a simple enough operation that there was no need to create a function even though I “knew” this rule.

Sigh.