Reading multiple text files from a directory

Hello all, I am very new to Julia and I have a question regarding reading some files. I need to read 12500 .txt files from the same directory and save them all into 1 array but I’m having performance issues. Is there a fast way of doing this? My code takes around like 60 seconds which is way more than I can afford. Here is what I have:

function load_train(directory)
    data = []
    dir = joinpath("./aclImdb/train/",directory)
    for f in readdir(dir)
        s = read(joinpath(dir,f),String)
        push!(data,s)
    end
    data
end

trainPos = load_train("pos/")

One thing, you could do is use multiple threads for this (be sure to set JULIA_NUM_THREADS=$(nproc) when you start Julia). Here, simply Threads.@threads will do the job. The only thing to note is that you have to preallocate the whole array instead of using push!, since resizing arrays is not thread safe. On my notebook with two physical cores and a SATA SSD, I get pretty decent scaling:

julia> for i in 'a':'z'
           write(string(i), rand(['a':'z'; 'A':'Z'; '0':'9'], 1000))
       end

julia> function load_train(dir)
           data = []
           for f in readdir(dir)
               s = read(joinpath(dir,f),String)
               push!(data,s)
           end
           data
       end
load_train (generic function with 1 method)

julia> function load_train2(dir)
           files = readdir(dir)
           data = similar(files, String)
           Threads.@threads for i in eachindex(files)
               s = read(joinpath(dir, files[i]), String)
               data[i] = s
           end
           data
       end
load_train2 (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime load_train(".");
  174.742 μs (455 allocations: 133.97 KiB)

julia> @btime load_train2(".");
  102.481 μs (471 allocations: 136.61 KiB)

At some point, you will probably end up being bottlenecked by either memory management or your disk speed. You might want to thing about, whether you really need to load all files into memory at once, since in a lot of cases, you can probably just read your data from disk directly when you need it. You could also use Mmap for this, but that is of course a lot more involved.

1 Like