I’ve recently written a simple script converting the OCVS (OGLE Collection of Variable Stars) database (which is basically a single text file containing a list of attributes of every star in the collection) and a large amount of tsv files containing photometry data for every one of them).
I’ve used HDF5.jl package because JLD2.jl (which wasn’t having the same issue) needs to load a whole database into RAM before editing it (please, correct me, if I’m wrong).
My code looks exactly like this (except table, full_headers, output_file, input_dir_I, and input_dir_V variables are initialized to values used by the database I’m trying to convert).
For some reason, the code doesn’t utilize more, than 20% of available CPU processing power (when initialized by ‘julia --optimize=3 --threads=12’ command), and the disk’s read/write rates are significantly below 10% of it’s capacity.
The code runs for ~25 minutes (on Lenovo TUF Gaming A15, with Ryzen 4600h and Intel 660p SSD) before completing the creation of the new database (with ~28 GB old one which is copied). Code without @simd did perform a little bit worse taking ~35 minutes for doing the same job.
Currently, it’s not an issue, but it’s possible I will be doing that with ~1000 times bigger database of the same time, so without any modifications doing that would take somewhere around 3 weeks.
Is there any simple way of making that code utilise close to 100% of available system resources? Or will I be better with rewriting the script with C++ (and HDF5 C++ API/HighFive and boost::spirit::qi)?
Here is the code;
using DataFrames
using CSV
using HDF5
using Base.Filesystem
h5open(output_file, "w") do output_file
@simd for row in eachrow(table)
if ismissing(row[2])
print("")
else filename = row[2]
println(filename)
g = create_group(output_file, filename)
if isfile(joinpath(input_dir_I, "$filename.dat"))
data = CSV.read(joinpath(input_dir_I, "$filename.dat"), DataFrame, header=false, types=Float32)
data = Matrix(coalesce.(data, NaN))
compressed_data = Matrix{Float32}(data)
g["I"] = compressed_data
end
if isfile(joinpath(input_dir_V, "$filename.dat"))
data = CSV.read(joinpath(input_dir_V, "$filename.dat"), DataFrame, header=false, types=Float32)
data = Matrix(coalesce.(data, NaN))
compressed_data = Matrix{Float32}(data)
g["V"] = compressed_data
end
for i in [1, 5, 6, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
if ismissing(row[i])
attributes(g)[full_headers[i]] = Float32(NaN)
elseif isapprox(row[i], -99.99, atol=0.02)
attributes(g)[full_headers[i]] = Float32(NaN)
else
attributes(g)[full_headers[i]] = Float32(row[i])
end
end
for i in [3, 4, 7, 8, 9, 10, 37, 38]
if ismissing(row[i])
attributes(g)[full_headers[i]] = ""
elseif length(string(row[i])) > 1 && !isnumeric(row[i][2])
attributes(g)[full_headers[i]] = string(row[i])
else
attributes(g)[full_headers[i]] = ""
end
end
end
end
end