Slow and Memory intensive For Loop

Hi Julians,

I am still a novice in Julia. I am writing a code to import/read a total 16 data (.txt) files each having 504*1025 matrix. Essentially trying to make a 3D data cube from the 2D datasets using a For loop. But it takes 6.9 seconds for the code with 22 M allocation. I will really appreciate any suggestions and advice to make the code faster. Please see the code below:

using Gtk4
using DelimitedFiles

function LoopScans()
   file_path = open_dialog("My Open dialog")
   file_location, filename = dirname(file_path), basename(file_path)
   Rawdata = readdlm(file_path)'

   file_name, file_ext = split(filename, ".")
   filestring, ScanIndex = match(r"[A-Za-z]+", file_name).match, parse(Int, match(r"\d+", file_name).match)

   files = readdir(file_location)
   # Filter files based on the pattern
   matching_files = filter(f -> occursin(filestring * r"\d+\." * file_ext, f), files)
   isempty(matching_files) && throw(ArgumentError("No files found matching pattern"))

   sample_data = readdlm(joinpath(file_location, matching_files[1]))'
   allScans = Array{Float64, 3}(undef, size(sample_data)..., length(matching_files))

   for (ii, file) in enumerate(matching_files)
       allScans[:, :, ii] = readdlm(joinpath(file_location, file))'
   end

   return allScans
end
@time LoopScans()

 6.968473 seconds (23.50 M allocations: 961.212 MiB, 3.03% gc time, 5.37% compilation time)
1025×504×16 Array{Float64, 3}:

Welcome to the forum!

Some remarks:

  1. please, enclose any code you share in triple back-ticks like ```
  2. if you call a function the first time it gets compiled, so call it a second time to determine the execution time
  3. avoid global variables if you need fast execution time

Finally, in this line:

allScans[:, :, ii] = readdlm(joinpath(file_location, file))

try to replace = with .= to avoid allocations.

1 Like

That makes no difference in this example. a[:, :, i] = function_returns_array() allocates an array for the right-hand-side and then writes it in-place into a slice of a. Changing = to .= does the same thing.

2 Likes

It depends on where the time is spent. If it’s spent on reading from the file system, there is little you can do. If it’s spent on parsing the input into floats, you can read in parallel, if you’ve got more than one cpu/core. Remember to start julia with threads, e.g. $ julia -t auto on linux.

using Base.Threads
...
...
@threads for i in eachindex(matching_files)
    allScans[:, :, i] = readdlm(joinpath(file_location, matching_file[i]))
end

(note that @threads is a bit picky about what you loop over, it should be a vector).
On my box with 8 cpus this gets the time down from 2.7 to 0.6 seconds. But, this depends on disk speed, if the files are in the disk cache, cpu speed, etc, etc.

1 Like

You should also try a more optimized file-import package like CSV.jl rather than the DelimitedFiles stdlib (which is simple and convenient but relatively slow).

1 Like

Thank you all for your suggestions and advice. Using the CSV.jl and threading as suggested by @stevengj and @sgaure does indeed improve the performance. For my Intel Core i7-12700 with 20 cores, the time is reduced from 6.96 seconds to 2.18 seconds (49.92 M allocations: 1.441 GiB). The allocations and memory usage seems too high though. Below is the latest code:

using DataFrames
using CSV
using Gtk4
using Base.Threads

function LoopScans()::Tuple{Array{Float64,3}, String}
    file_path = open_dialog("Select data scans")
    file_location, filename = dirname(file_path), basename(file_path)
    file_name, file_ext = split(filename, ".")
    filestring, ScanIndex = match(r"[A-Za-z]+", file_name).match, parse(Int, match(r"\d+", file_name).match)

    files = readdir(file_location)
   
    matching_files = filter(f -> occursin(filestring * r"\d+\." * file_ext, f), files) # Filter files based on the pattern
    ScanNumber=length(matching_files)
    isempty(matching_files) && throw(ArgumentError("No files found matching pattern"))

    Rawdata = DataFrame(CSV.File(joinpath(file_location, matching_files[1])))

    if size(Rawdata, 1) > size(Rawdata, 2);

    Rawdata = Rawdata;
    else
    Rawdata = permutedims(Rawdata);
    end
   allScans = Array{Float64, 3}(undef, size(Rawdata)..., ScanNumber)

   @threads for (ii, file) in collect(enumerate(matching_files))
    allScans[:, :, ii] .= permutedims(DataFrame(CSV.File(joinpath(file_location, file))))
    end

    return allScans, file_location 
end
@time LoopScans() 
   

2.183225 seconds (49.92 M allocations: 1.441 GiB, 3.95% gc time)

You could also try Introduction · InMemoryDatasets .
The author(s) write:

The package performance is tuned with two goals in mind, a) low overhead of allowing missing values everywhere, and b) the following priorities - in order of importance:

  • Low compilation time
  • Memory efficiency
  • High performance

I used it with good success, but if it is better for your use case I do not know.

1 Like