Creating a struct from a YAML file

ufechner7 · December 29, 2021, 12:14am

I want to provide a struct with configuration variables (parameters) to my simulation.

I have a yaml file like this one:

system:
    log_file: "data/log_8700W_8ms" # filename without extension  [replay only]
    time_lapse:   1.0              # relative replay speed
    sim_time:   100.0              # simulation time             [sim only]

(Well, in reality much bigger, see: https://github.com/ufechner7/KiteUtils.jl/blob/main/data/settings.yaml)

Now I would like to create the following struct automatically:

using Parameters
@with_kw mutable struct Settings @deftype Float64
    log_file::String      = ""
    time_lapse            = 0
    sim_time              = 0
end

const SETTINGS = Settings()

In addition I would like to autogenerate the code that reads the yaml file values
into the struct, like this:

function se(project)
    dict = YAML.load_file(joinpath(DATA_PATH[1], project))
    SETTINGS.log_file    = dict["system"]["log_file"]
    SETTINGS.time_lapse  = dict["system"]["time_lapse"]
    SETTINGS.sim_time    = dict["system"]["sim_time"]
    SETTINGS
end

What would be a good approach do do that?

Background:
I would like to provide a function se() to many modules and users such that they can
easily use the same settings in many modules of the project. It should also be
possible for them to add or delete parameters, but that would never happen frequently,
so if this generation process is slow that doesn’t matter.
What does matter is that the runtime access is fast, and in my inner loop I save
100ns if I have my parameters defined as constant mutable struct.

jzr · December 29, 2021, 12:38am

JSON3.jl has this functionality but I’m not sure why it’s not in a generic format-agnostic package so it could handle yaml, toml etc. @quinnj

ufechner7 · December 29, 2021, 1:02am

Interesting!

ufechner7 · December 29, 2021, 1:48am

Matlab code for this task (well, not enough features, but still…):
https://github.com/llerussell/ReadYAML

liuyxpp · December 29, 2021, 2:52am

I am also looking for the best solution to this. Sometimes ago I had a look at Configurations.jl and StructTypes.jl. But I am not fully convinced to treat my settings for simulations this way.

GunnarFarneback · December 29, 2021, 10:10am

Now I’m curious, do you get the desired runtime access speed from the Matlab code? Ignoring the performance it’s easy enough to emulate a subset of the Matlab struct functionality by wrapping a dictionary in a type with property overloading and read your YAML into that type.

For what it’s worth I usually handle the type instability in these kinds of scenarios by a function barrier to the inner loops.

ufechner7 · December 29, 2021, 12:17pm

I did a small benchmark:

using BenchmarkTools, Parameters, YAML

@with_kw mutable struct Settings @deftype Float64
    log_file::String      = ""
    time_lapse            = 0
    sim_time              = 0
end

const set = Settings()
const DATA_PATH = ["./data"]
const dict = YAML.load_file(joinpath(DATA_PATH[1], "settings.yaml"))

function se(project="settings.yaml")
    dict = YAML.load_file(joinpath(DATA_PATH[1], project))
    set.log_file    = dict["system"]["log_file"]
    set.time_lapse  = dict["system"]["time_lapse"]
    set.sim_time    = dict["system"]["sim_time"]
    set
end

function simulate1()
    res=1.0
    for i in 1:1000
        res += set.time_lapse
        res += set.sim_time
    end
    res
end

function simulate2()
    res=1.0
    for i in 1:1000
        res += dict["system"]["time_lapse"]
        res += dict["system"]["sim_time"]
    end
    res
end

se()

Results:

julia> @benchmark simulate1()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.740 μs …  10.901 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.742 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.755 μs ± 157.401 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █              ▂                                            ▁
  █▆▁▄▄▃▁▃▁▁▁▁▁▁▁█▆▃▄▁▃▁▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅▅▄ █
  1.74 μs      Histogram: log(frequency) by time      1.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark simulate2()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  127.447 μs … 944.569 μs  ┊ GC (min … max): 0.00% … 81.83%
 Time  (median):     130.597 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   133.943 μs ±  25.210 μs  ┊ GC (mean ± σ):  0.40% ±  2.21%

   ▁▄▇██▇▅▃▂▂▁    ▁▃▄▄▄▂▁  ▁▁▁▁                                 ▂
  ▅████████████▆▆▇███████▇▇███████▇▆▆▅▅▃▄▁▆▆▄▄▅▅▅▅▆▅▇▇█▇▆▇▇▆▆▅▆ █
  127 μs        Histogram: log(frequency) by time        163 μs <

 Memory estimate: 31.25 KiB, allocs estimate: 2000.

So access via a struct is about 75 times faster than dictionary access…

GunnarFarneback · December 29, 2021, 12:38pm

That is normal for type unstable code, which indeed is the drawback of pulling things out of a dict in a hot loop.

What I alluded to was, and I acknowledge that it’s just a workaround,

# Function barrier.
function simulate3()
    inner_loop(dict["system"]["time_lapse"], dict["system"]["sim_time"])
end

function inner_loop(time_lapse, sim_time)
    res=1.0
    for i in 1:1000
        res += time_lapse
        res += sim_time
    end
    res
end

A similar workaround is to eliminate the type instability by a type assertion:

function simulate4()
    time_lapse::Float64 = dict["system"]["time_lapse"]
    sim_time::Float64 = dict["system"]["sim_time"]
    res=1.0
    for i in 1:1000
        res += time_lapse
        res += sim_time
    end
    res
end

ufechner7 · December 29, 2021, 12:53pm

I agree that your workaround would help, even though it is still slower than the first solution:

julia> repr(@benchmark simulate1())
"Trial(1.740 μs)"

julia> repr(@benchmark simulate3())
"Trial(1.907 μs)"

julia> repr(@benchmark simulate4())
"Trial(2.072 μs)"

But I have about 30 parameters, passing all of them as function parameters is not a practical solution. In addition I have about 10 functions that need these parameters, so defining local, typed variables in each of them would add a lot of lines of code.

I think a const global mutable struct is still the best solution (fastest with least amount of code), but I want to autogenerate it.

aplavin · December 29, 2021, 1:05pm

A simple solution is to convert the dict to a namedtuple:

dicts_to_nt(x) = x
dicts_to_nt(d::Dict) = (; (Symbol(k) => dicts_to_nt(v) for (k, v) in d)...)

params = YAML.load(...) |> dicts_to_nt

# then pass params to all your functions that need them:
function simulate(params)
...
end

simulate(params)

Namedtuples are basically “ad-hoc structures”, it makes sense to use them when you think you need to autogenerate a struct.

ufechner7 · December 29, 2021, 1:46pm

Named tuple looks nice… very little code to write, very fast …

But they are not mutable… And that means, if I define a const global named tuple I cannot change the values at runtime…

Passing as parameter to the function is not easy for me, because I work with solvers and have call-back functions where I cannot always add custom parameters…

One example is NLsolve.jl It expects the following function signature for the callback function:
f!(F::AbstractArray, x::AbstractArray)

I would not know how to add a named tuple here…

aplavin · December 29, 2021, 2:04pm

Immutability is a feature and not a drawback (:

I think this should work for callbacks:

solve(x -> f(x, params))

Btw, a convenient way to set nested fields is available in the Accessors.jl package:

using Accessors
new_params = @set params.some.deep.field = 123

ufechner7 · December 29, 2021, 3:13pm

Your solution for callbacks is working, but it slows down the solution by a factor of 600:

using NLsolve, YAML

const DATA_PATH = ["./data"]
const dict = YAML.load_file(joinpath(DATA_PATH[1], "settings.yaml"))

dicts2nt(x) = x
dicts2nt(d::Dict) = (; (Symbol(k) => dicts2nt(v) for (k, v) in d)...)
const nt = dicts2nt(dict)

function f!(F, x)
    F[1] = (x[1]+3)*(x[2]^3-7)+18
    F[2] = sin(x[2]*exp(x[1])-1)
end

function j!(J, x)
    J[1, 1] = x[2]^3-7
    J[1, 2] = 3*x[2]^2*(x[1]+3)
    u = exp(x[1])*cos(x[2]*exp(x[1])-1)
    J[2, 1] = x[2]*u
    J[2, 2] = u
end

function f1!(F, x, params)
    F[1] = (x[1]+3)*(x[2]^3-7)+18
    F[2] = sin(x[2]*exp(x[1])-1)
end

function j1!(J, x, params)
    J[1, 1] = x[2]^3-7
    J[1, 2] = 3*x[2]^2*(x[1]+3)
    u = exp(x[1])*cos(x[2]*exp(x[1])-1)
    J[2, 1] = x[2]*u
    J[2, 2] = u
end

nlsolve(f!, j!, [ 0.1; 1.2])
@time nlsolve(f!, j!, [ 0.1; 1.2])

const params = nt
nlsolve(((F, x) -> f1!(F, x, params)), ((J, x) -> j1!(J, x, params)), [ 0.1; 1.2] )
@time nlsolve(((F, x) -> f1!(F, x, params)), ((J, x) -> j1!(J, x, params)), [ 0.1; 1.2] )

Output:

julia> include("src/Solve.jl")
  0.000029 seconds (57 allocations: 3.938 KiB)
  0.018826 seconds (36.23 k allocations: 2.016 MiB, 99.69% compilation time)
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [0.1, 1.2]
 * Zero: [-3.487552479724522e-16, 1.0000000000000002]
 * Inf-norm of residuals: 0.000000
 * Iterations: 4
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 5
 * Jacobian Calls (df/dx): 5

So for now named tuples do not work for me.

aplavin · December 29, 2021, 4:29pm

Namedtuples have the same performance as custom structs.
Looks like you benchmark the code in the global scope. What happens when it is executed from a function, as it should be?

ufechner7 · December 29, 2021, 4:53pm

function solve1()
    nlsolve(f!, j!, [ 0.1; 1.2])
end
solve1()
@time solve1()

const params = nt
function solve2()
    nlsolve(((F, x) -> f1!(F, x, params)), ((J, x) -> j1!(J, x, params)), [ 0.1; 1.2] )
end
solve2()
@time solve2()

Output:

julia> include("src/Solve.jl")
  0.000028 seconds (57 allocations: 3.938 KiB)
  0.000030 seconds (57 allocations: 3.938 KiB)
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [0.1, 1.2]
 * Zero: [-3.487552479724522e-16, 1.0000000000000002]
 * Inf-norm of residuals: 0.000000
 * Iterations: 4
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 5
 * Jacobian Calls (df/dx): 5

OK, this solved the performance issue in the most simple case…
Still to be tested is the performance, when I change values in the named tuple. Does this result in recompilation? It does not for a const mutable struct.

ufechner7 · December 29, 2021, 4:56pm

Autogenerating structs is also not so difficult:

using Parameters, YAML, OrderedCollections

settings_yaml="""
system:
    log_file: "data/log_8700W_8ms" # filename without extension  [replay only]
    time_lapse:   1.0              # relative replay speed
    sim_time:   100.0              # simulation time             [sim only]
"""
const DATA_PATH = ["./data"]
const dict = YAML.load(settings_yaml; dicttype=OrderedDict{String,Any})

function parse_dict(dict)
    res = "@with_kw mutable struct Settings\n"
    for (name, value) in dict["system"]
        res *= name * "::" * repr(typeof(value)) * " = " * repr(value) * "\n"
    end
    return res * "\nend"
end

code = parse_dict(dict)
ast  = Meta.parse(code)
eval(ast)
const set = Settings()

Output:

Settings
  log_file: String "data/log_8700W_8ms"
  time_lapse: Float64 1.0
  sim_time: Float64 100.0

Not yet implemented: Nested structs.

quinnj · January 11, 2022, 5:09am

@mcmcgrath13 and I had briefly discussed splitting the type-generation code out from JSON3.jl. We can still do it, we’ve just been letting that code mature a bit in JSON3.j (as noted by a few recently enhanced issues).

ufechner7 · January 11, 2022, 8:40am

Great!

tienviitta · March 8, 2022, 12:25pm

Hi,

I’m also trying to figure out some sort of simulation configuration from YAML (or JSON3). I already have a module with default parameters as:

# Simulation Configuration(s)
module SimConf

using YAML
using Configurations
using JSON3
using Logging

# Params (Note! Defaults given here!)
@option "Sim" struct Sim
    name::String = "unspecified"
    seed::Int64 = 0x1234
    n_frames::Int64 = 10
end
@option "Tx" struct Tx
    n_bytes::Int64 = 13
    rnti::Int16 = 0x5555
end
@option "Channel" struct Channel
    snr_db::Vector{Float64} = [snr_db for snr_db in 2:7]
end
@option "Params" struct Params
    sim::Sim = Sim()
    tx::Tx = Tx()
    channel::Channel = Channel()
end

function read(fn_yml)
    read_yml = YAML.load_file(fn_yml; dicttype = Dict{String,Any})
    p = Configurations.from_dict(Params, read_yml)
    @info "Params:"
    @info "  name: $(p.sim.name)"
    @info "  seed: $(p.sim.seed)"
    @info "  n_frames: $(p.sim.n_frames)"
    @info "  n_bytes: $(p.tx.n_bytes)"
    @info "  rnti: $(p.tx.rnti)"
    @info "  snr_db: $(p.channel.snr_db)"
    return p
end

function write(fn_json, p)
    d = Configurations.to_dict(p, JSONStyle)
    JSON3.write(fn_json, d)
end

end

This works already quite nice with YAML as:

# Simulator Parameters
sim:
  name:     test
  seed:     0x123456
  n_frames: 2

# Tx Parameters
tx:
  n_bytes:  13
  rnti:     0x4567

# Channel Parameters
channel:
  snr_db:   [1.0, 2.0, 3.0]

But I wasn’t able to use StepRange like 1.0:0.5:3.0 or comprehension as the snr_db. What do you think would there be some way to use these in the YAML file?

GunnarFarneback · March 8, 2022, 2:48pm

In my not so humble opinion, don’t do that even though it is possible. Either spell out the list or store the start, stop, step values as separate parameters. Limiting yourself to basic data in configuration files wins in the long run.