Optimizing concatenation of data from large number of files

lwhitefox · September 14, 2020, 11:58pm

I am trying to create a large data array from data in many input files by

reading in the individual data file arrays
concatenating a subset of columns to generate a large, single array

The problem is that it is blazingly fast for the first, say, 100 files (depends on how I try to concatenate) but then each subsequent iteration slows down dramatically. So, I’m obviously doing something very inefficient - I’d assume trying to use vcat() would be slow because it would keep allocating to pad the current array. For this reason, I instead allocate the entire final array up front and try to fill in the data as it is read from individual files.

I include an MWE example below - because all test arrays are small, it is very quick - but I’m hoping someone will see what is bad about the approach as I need to to run quickly for much larger arrays (individual arrays read in are on the order to 15000 x 1000).

using Glob

n_columns = 100
n_rows = 100
# create fake data
for i in 1:100
    d = rand(n_rows,n_columns)
    write("file_$(i).bin",d)
end

idx_cols_keep = 10:20

data_in = Array{Float32,2}(undef,n_rows,n_columns);
# yes order will be "wrong" comapred to indices but this is just an MWE to demo the idea
filelist = glob("*.bin",".");

nrows_total = n_rows*length(filelist)

data_total = Array{Float32,2}(undef,nrows_total,length(idx_cols_keep));

i0 = 1
nmax = length(filelist)
for i in 1:nmax
    read!(filelist[i],data_in)
    data_total[i0:i0+n_rows-1,:] .= data_in[:,idx_cols_keep]
    i0 += n_rows
end

baggepinnen · September 15, 2020, 4:59am

I would suggest trying to put the code in a function as opposed to in global scope. The compiler will think much harder about the code that way, and can possibly optimize memory usage better.
See Performance Tips · The Julia Language

lwhitefox · September 15, 2020, 5:05pm

You were right - I was stupidly assuming that this was a simple enough operation that there was no need to create a function even though I “knew” this rule.

Sigh.

Topic		Replies	Views
Concatenate csv files without loading them General Usage question , csv	10	2601	April 13, 2022
Very best way to concatenate an array of arrays General Usage arrays	47	28110	July 24, 2021
Concatenating iterables without allocating memory Performance	34	7246	January 16, 2020
Proper way to concatenate higher-dimensional arrays? New to Julia arrays	23	2387	December 2, 2020
Concatenating matrices General Usage	3	11146	January 1, 2019

Optimizing concatenation of data from large number of files

Related topics