HDF5 utilising only a fraction of available system resources

silkskier · May 8, 2023, 2:45pm

I’ve recently written a simple script converting the OCVS (OGLE Collection of Variable Stars) database (which is basically a single text file containing a list of attributes of every star in the collection) and a large amount of tsv files containing photometry data for every one of them).

I’ve used HDF5.jl package because JLD2.jl (which wasn’t having the same issue) needs to load a whole database into RAM before editing it (please, correct me, if I’m wrong).

My code looks exactly like this (except table, full_headers, output_file, input_dir_I, and input_dir_V variables are initialized to values used by the database I’m trying to convert).

For some reason, the code doesn’t utilize more, than 20% of available CPU processing power (when initialized by ‘julia --optimize=3 --threads=12’ command), and the disk’s read/write rates are significantly below 10% of it’s capacity.

The code runs for ~25 minutes (on Lenovo TUF Gaming A15, with Ryzen 4600h and Intel 660p SSD) before completing the creation of the new database (with ~28 GB old one which is copied). Code without @simd did perform a little bit worse taking ~35 minutes for doing the same job.

Currently, it’s not an issue, but it’s possible I will be doing that with ~1000 times bigger database of the same time, so without any modifications doing that would take somewhere around 3 weeks.

Is there any simple way of making that code utilise close to 100% of available system resources? Or will I be better with rewriting the script with C++ (and HDF5 C++ API/HighFive and boost::spirit::qi)?

Here is the code;

using DataFrames
using CSV
using HDF5
using Base.Filesystem


h5open(output_file, "w") do output_file
    @simd for row in eachrow(table)
        if ismissing(row[2])
            print("")
        else filename = row[2]
            println(filename)

            g = create_group(output_file, filename)

            if isfile(joinpath(input_dir_I, "$filename.dat"))
                data = CSV.read(joinpath(input_dir_I, "$filename.dat"), DataFrame, header=false, types=Float32)
                data = Matrix(coalesce.(data, NaN))
                compressed_data = Matrix{Float32}(data)
                g["I"] = compressed_data
            end

            if isfile(joinpath(input_dir_V, "$filename.dat"))
                data = CSV.read(joinpath(input_dir_V, "$filename.dat"), DataFrame, header=false, types=Float32)
                data = Matrix(coalesce.(data, NaN))
                compressed_data = Matrix{Float32}(data)
                g["V"] = compressed_data
            end


            for i in [1, 5, 6, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
                if ismissing(row[i])
                    attributes(g)[full_headers[i]] = Float32(NaN)
                elseif isapprox(row[i], -99.99, atol=0.02)
                    attributes(g)[full_headers[i]] = Float32(NaN)
                else
                    attributes(g)[full_headers[i]] = Float32(row[i])
                end
            end


            for i in [3, 4, 7, 8, 9, 10, 37, 38]
                    if ismissing(row[i])
                    attributes(g)[full_headers[i]] = ""
                elseif length(string(row[i])) > 1 && !isnumeric(row[i][2])
                    attributes(g)[full_headers[i]] = string(row[i])
                else
                    attributes(g)[full_headers[i]] = ""
                end
            end
        end
    end
end

mkitti · May 8, 2023, 4:34pm

HDF5, the C library, only supports a single thread. If you want parallelism with HDF5 the supported path is to use MPI. See MPI.jl.

That said I already see some immediate issues with your Julia code. It would probably be advantageous to put this code into a function where it can be compiled and analyzed. Also I see the use of some globals such as filename or full_headers which may be not bound to a particular type. From there, you may want to use @code_warntype to examine the type stability of your code. Another specific example is doing full_headers[i] multiple times within the same loop. It might be best to do this once and perhaps also declare this to be @inbounds if that is true. Additionally, I do not see any parallelism on the Julia side. You have not spawned tasks or used a threaded for loop. Overall, I think there is a lot to optimize in your Julia code, and you are not using Julia to its full potential. In general, I would avoid writing large scripts like this in Julia and build as much into functions as possible.

For other tips, see the Performance Tips page.

https://docs.julialang.org/en/v1/manual/performance-tips/

A rewrite in C++ might be faster mainly because it would force you to actually compile it. However, you can likely achieve the same performance in Julia. In some ways, this would entail writing the Julia code more like how you write the C++ code (use functions, avoid globals, type your globals, and type your variables. Ultimately, you run into the restrictions of the HDF5 C library.

silkskier · May 8, 2023, 4:59pm

Thanks for help. I’ve misinterpreted the thread-safe version of HDF5 documentation, and I’ve mistakenly assumed that the thread-safe version of HDF5 does support parallel write to different groups within the same file instead of just different files. If it’s the only issue than launching the script with GNU Parallel (each time for different files) should allow for 100% CPU utilization (Btw thanks for MPI suggestion, I will read a little bit more about it as well).

Due to that, I assumed that the first loop was threaded. I guess the @simd did speed up the loop execution because it allowed for tsv files readout during write data of previous files to the database, with did result in a little bit of speed up (however --threads=2 should be enough instead of letting it to use all threads at once). I’ve already seen the guide, and thanks for the rest of the suggestions, I’ll try to implement them later.

Btw about C++, I’ve tested different ways of writing code, and almost always writing long scripts without functions, or at least trying to write functions as long as possible (often 100+ lines of code) gave the best results in terms of performance due to giving the compiler much more freedom, than usage of a code with a lot of short functions. At least that was the case with -O3 option, for -O1 it wasn’t always the case. Because of that I’ve tried to utilise the same strategy here.

EDIT: So I guess that for systems with more than enough RAM C++ speed advantage will be mostly limited to a little bit better efficiency of qi as the parser and lack of garbage collection spikes. It still probably wouldn’t give it enough advantage to be worthy of rewriting.

mkitti · May 8, 2023, 5:40pm

The thread safe version of HDF5 mainly just installs a lock around all API calls. If you are using our HDF5_jll, then you are probably not using the thread safe version of HDF5. However, we do use a lock on the Julia side in HDF5.jl.

There is no implicit threading in Julia itself, although some libraries such as BLAS do use threads. If you want to use threads in Julia use Threads.@threads . SIMD is distinct from using threads. See the threading docs.

https://docs.julialang.org/en/v1/manual/multi-threading/

At least in C++, you would need a main(). Writing main() may also have benefits here because it would create a local scope and create a compilation unit. Here you are slightly saved by the the do block which creates an anonymous function. However, your current mode of use would involve your code getting recompiled every time you run the script. You might want to measure with @time how much time is being spent compiling versus running your code. The other concept I wanted to raise is inlining functions which applies to both C++ and Julia. The compiled code would effectively act as if your code were in a single function.

An additional issue is that type stability in Julia is important due to dynamic dispatch. If types can be determined at compile time, then dynamic dispatch can be avoided.

Note that some optimization here is also possible in Julia by avoiding dynamic allocation. The concept is there is no garbage collection to do if you do not create any garbage to collect. See StaticArrays.jl for example.

silkskier · May 8, 2023, 10:33pm

I’ve been able to optimize my code a little bit accordingly to your tips.

What’s interesting the execution time difference was barely noticeable (from ~ 25 minutes before the changes to ~23 minutes and 50 seconds now). As expected the compilation time did significantly increase (from around 8 seconds before to ~ 20 seconds now).

What is the most interesting is the change in resource usage. The original code did induce about 180-250% load accordingly to htop, but now that value dropped down to the 80-105% range. The RAM usage slightly dropped as well (from ~ 1.4 to ~1.1 GB).

The compilation time also shouldn’t be an issue with GNU Parallel, even more so if the script is precompiled.

Thanks for help again, now I do know what to do, and that range of speed should be more than enough, at least for that case.

mkitti · May 8, 2023, 10:41pm

I suspect the primary bottleneck here is the HDF5 C library itself.

You may also want to consider the use of the Distributed module rather than GNU Parallel if you want to stay within one language:
https://docs.julialang.org/en/v1/manual/distributed-computing/

Instead of the -t option for threads, you can use the -p option to add worker processes. You can then use the @distributed macro to run a parallel for loop over the processes.

help?> Distributed.@distributed
  @distributed

  A distributed memory, parallel for loop of the form :

  @distributed [reducer] for var = range
      body
  end

  The specified range is partitioned and locally executed across all workers. In case an optional reducer
  function is specified, @distributed performs local reductions on each worker with a final reduction on the
  calling process.

  Note that without a reducer function, @distributed executes asynchronously, i.e. it spawns independent
  tasks on all available workers and returns immediately without waiting for completion. To wait for
  completion, prefix the call with @sync, like :

  @sync @distributed for var = range
      body
  end