Reading model parameters and variables from excel

Hello everyone!

I have been migrating slowly some part of my work from Matlab to Julia and came across a more specific issue. Over the years I have automated some part of my matlab workflow using excel files for reading model parameters and defaults. It looks like this:

I have an excel file with a setup sheet. The columns are my “models”. The rows on the top part are model parameters. The second part is variables, which will be later loaded in the model.

In Matlab you work with this very simply by reading it into a table, let’s call it stab and then picking out whatever you need with one-liners. For example stab('nburn','bg_jul') will give me the parameter nburn, while stab(31:end,'bg_jul') gives me a vector that can tell me which variables I need to load from my dataset, later to be used such as VarNames( stab(31:end,'bg_jul') ), which will give me list of the variable names and they can as a direct input in another table with the dataset to select only the relevant variables. You can also always call the original row and get the relevant entries, irrespective of how they are ordered, because you are using the variable names.

Naturally I started trying to write a code that does this but I am unsure of what is the best way to proceed, specifically what are the best type to try to use? I started, by using the XLSX package and then trying to put stuff in dictionaries and using Symbol but it feels quite cumbersome. Should I use tables? DataFrames seem very powerful but a large dependency and I am trying to use a more light-weight approach.

Also, is there a better way? Porting matlab ideas is not necessarily the best. The top part can be done with structures, because it is always identical, but the lower part with the variable names is always dynamic.

I can provide a minimal working example if somebody wants to play around with but I guess my question is more about the design of the code.

On my laptop (Ryzen 7840U CPU), on battery,

using Pkg
@time Pkg.add("DataFrames")

takes less than 30s.
And

@time using DataFrames

takes 0.89 s.

I would not call this a heavy dependency. Your mileage might very.

I’d say, what I see in your screenshot looks like a DataFrame :wink:

nd = Dict(p.second => p.first for p in pairs(df.FullNames))

And now you can access rows both by name and by index:

nburn = df[nd["nburn"], "bg_jul"]
v = df[31:end, "bg_jul"]

Could try this directly in XLSX.jl.
I have no missing rows, but otherwise identical data:

using XLSX
f=XLSX.open_xlsx_template("extracted_data.xlsx")
stab=f[1]
cols = Dict(last(p) => first(p) for p in (enumerate(stab[1,:])))
rows = Dict(last(p) => first(p) for p in (enumerate(stab[:,1])))

Create two Dicts for row name and col name, indexing the row number and column number.

You can then index like this:

julia> stab[rows["nburn"], cols["bg_jul"]]
5000

julia> stab[2:end, cols["bg_jul"]]
15×1 Matrix{Any}:
 5000
 5000
    1
    0
    1
    4
   12
    0
    0
    0
    0
    0
    0
    0
    0

With a bit of contrivance, you can also extract a block of values this way, too:

julia> stab[XLSX.CellRange(XLSX.CellRef(rows["nburn"], cols["dgp12"]), XLSX.CellRef(rows["lastRow"], cols["bg_jul"]))]
8×6 Matrix{Any}:
 1000  1000  1000  1000  20000  5000
 1000  1000  1000  1000  10000  5000
    1     1     1     1      1     1
    0     0     0     0      0     0
    1     1     1     1      1     1
    4     4     4     4      4     4
   12    12    12    12     12    12
    0     0     0     0      0     0

(Probably easier just to use a DataFrame, though!)

1 Like

If you just want the parameters and you also want the variable names:

julia> hcat(s[rows["nburn"]:rows["lastRow"], cols["FullNames"]], s[rows["nburn"]:rows["lastRow"], cols["bg_jul"]])
8×2 Matrix{Any}:
 "nburn"          5000
 "nsim"           5000
 "suppressout"       1
 "crossRestrict"     0
 "freqMix"           1
 "withinQ_iter"      4
 "H"                12
 "lastRow"           0

Have you seen the XLSX.readtable example and docs? They show the usage as converting to a DataFrame, but you can try the Tables.jl utility functions if you want something more lightweight. I don’t understand the stab(31:end,'bg_jul') part though (that range in the sheet seems just 0s, but maybe I’m misunderstanding Matlab indexing), so I’m not sure how much of your requirements will be accomplished with this, but it should at least be more convenient than handling the dictionaries and Symbol code yourself.

Even with DataFrames.jl, there is not a convenience syntax for accessing the cell values similar to your Matlab syntax (see this issue for discussion). There are other syntaxes that also work (subsetting is particularly effective), but the as-designed method to index into a DataFrame by values is (unfortunately) this:

gstab = groupby(stab, :FullNames)
freqMix_bg_jul = only(gstab[(freqMix,)]).bg_jul

I think the best data structure here is not 2D though. Julia lets you organize (nest) deeper than Excel and Matlab, and you should take advantage of that. Maybe something like:

struct Model
    p::Parameters
    v::Variables
    type::String
end
struct Parameters
    nburn::Int
    nsim::Int
    suppressout::Bool
    ...
end

The indexing gets easier then too:

julia> dgp11.type
"dgp"

julia> dgp11.p.nsim
1000 

julia> dgp11.v.R1q
2

julia> typeof(dgp11)
Model

I’d write a pre-processing function to read your data in with XLSX.jl and transform it into more nested custom containers like above. You could use vectors or named tuples for some fields if you want more flexibility, but you need to be careful to define structs with concrete types. You could also give the struct fields union types e.g. Union{Int, Nothing} to make them optional.

Apologies for the radio silence while you are trying to fix my problems. Let me try to answer.

@ufechner7 and @Eben60 I have been trying to keep my dependencies as few as possible. Also my post is whether there is a neat built-in julia way for this, because I don’t think it is very sophisticated thing. I might try to take the plunge and finally start learning DataFrames at some point but if there is a lightweight fast way to not use it, I would prefer to.

@TimG I have already written a function using the XLSX package, I can share my code in a subsequent post.I converted all strings to a symbol, I converted the numbers to floats, etc. You did it more elegantly, I didn’t know I can just call the column, e.g. cols["bg_jul"], I will keep that in mind. My issues started at the point where I want to make a sub-selection of rows based on the columns (e.g. all rows in “bg_jul” model where the number is unequal 0), and then map these numbers to those names so that if I have a list of those names in different order everything would work. This is where I started using Dicts and where I asked myself if there is an easier way to go about it in Julia.

@digital_carver I haven’t looked at Tables.jl yet, this is a good tip. For the Matlab syntax that you mentioned that you didn’t understand, I shortened it a bit for expositional purposes, so it doesn’t make sense this way :slight_smile: . In the o.g. code, the first row of the matlab table is a ‘special’ row with the RowNames property (not that special, just a cell array of strings with some constraints on how the string is allowed to look like). The full code finds the row in the RowNames with an entry “lastRow” and takes its index (30 in the example above). Then, the RowNames below (31:end) are all variable names, whose list I can define with the matlab command rn=Table.Properties.RowNames(31:end). The nice thing about this rnvariable is, that it is a vector of properly converted names that can be used to call variables in another table. For a dataset data_tab, where each variable is a column I can simply do data_tab(:,rn) and that would call the relevant variables. In the same way, if I have a subset of rn, I can only call that subset.

@Nathan_Boyer hmm, thanks for this! Another reason not to want to use DF is that it is another package that I have to learn and I’ve been trying to keep the number of new things at once to a minimum. It would have taken quite a while to find the proper syntax!
I really like your suggestion with named tuples (I have been using dicts for this so far), but I have to read more on union types, never have used those with structs, so I don’t know the behaviour.