I am wondering how everyone properly defines structs for their own modules. I read that in Julia, one usually defines one base module where general packages are imported, file dependencies to own functions are included via \include … and finally relevant functions are exported.
Usually, there is also a file included that first defines structs that serve as output of user defined functions - here comes my question. How do you best define structs if you create several functions that create similar ouput?
2 examples:
One might be interested to generate hidden Markov models; there are many examples (basic HMM, semi HMM, factorial HMM…), and, depending on the model, there are different kinds of interesting output.
One might be interested to do Bayesian inference via MCMC, but depending on the algorithm (HMC, SMC, Gibbs, etc.) there are different output statistics of interest.
I have looked through several popular packages, and - to my crude knowledge - people usually define first a general class, and then subclasses for each individual algorithm via a struct command. One could, alternatively, use the Parameters module, define only one struct with elements of all similar algorithms (i.e. output statistics of HMC, SMC, Gibbs sampler all in one class) and assign a default value of nothing/undefined to it. You then populate this struct with only the relevant elements for each algorithm, the rest will be kept as undefined.
I have not seen anyone using the second method, but it seems to me it has some advantages. For instance, I can actually populate the struct by the corresponding keyword (in any order?) when I create such a struct - and everything that is not named will just have no nothing as default value. This is also of advantage if I happen to have another idea for my struct and a posteriori want to add an additional element to the struct. In the first popular method, I would need to include that new element in all files where I previously defined that struct. With the “Parameters” method, I would only need to include it
where it matters and for all other cases it defaults to nothing. Seems especially nice if the module is still very new and experimental.
So, why is the first method so dominant in all popular modules? What would you recommend for the 2 examples I stated above? I would love if the Parameters module functions would be included in the base version of Julia, it seems super helpful.
I am not sure I understand what you mean here. Most packages have a single module. Complex and large packages may get organized into submodules. include simply includes code into the same module.
Sometimes programmers define an abstract type, and then subtypes for it (which may be abstract or concrete types). Abstract types are mostly useful for dispatch. In your particular examples, I would just try to see if composition helps (search for “composition inheritance”). Making a general struct for which fields may be nothing could also work, it is hard to say more without details.
I am not sure I understand what you mean here. Most packages have a single module. Complex and large packages may get organized into submodules. include simply includes code into the same module.
Yes, absolutely. Apologies for my vague explanation. I read about your “composition inheritance” tips and a nested composition structure seems to work with my example. I made a more specific one below. Lets say I want to implement several MCMC algorithm, they all share that at the end of the sampler one wants to have access to the corresponding Markov chain output, but there are other statistics that might be of interest depending on the specific sampler. So I created a general struct with fields that are of interest for each sampler, and then structs that have this struct as element + the case specific fields:
abstract type AbstractMCMC end
abstract type AbstractMetropolis <: AbstractMCMC end
abstract type AbstractGibbs <: AbstractMCMC end
struct MCMCGeneral <: AbstractMCMC
Chain::Array{Float64}
Iterations::Integer
end
struct MCMCMetropolis <: AbstractMetropolis
mcmc::MCMCGeneral
MetroplolisSpecific1::Float64
MetroplolisSpecific2::Float64
end
struct MCMCGibbs <: AbstractGibbs
mcmc::MCMCGeneral
GibbsSpecific1::Float64
GibbsSpecific2::Float64
end
(1) Is this considered okay in Julia? AFAIK, in your your DynamicHMC package, you defined the NUTS struct for your output of your sampler. What would you do if you would add a separate sampler to your package? (2) If I want to define some functions for an object of type MCMCMetropolis and MCMCGibbs, I assume I should try to make it work for AbstractMCMC and only do additional computations if fields like MetroplolisSpecific1 or GibbsSpecific1 exist? Something like:
using Plots
function outputsummary(x::T; sampler::String) where {T<:AbstractMCMC}
#do something with x
plot(1:x.mcmc.Iterations, x.mcmc.Chain)
#Make additional things depending on the sampler
if(sampler == "Gibbs") println(1) end
if(sampler == "Metropolis") println(2) end
end
Please note that DynamicHMC is a package I wrote when I was a Julia newbie. Its organization is by no means exemplary, I would do a lot of things differently now (and will do it, when I refactor over the summer).
Sure, working with a hierarchy of types is a valid design pattern in Julia. But frequently, there are multiple such hierarchies, so this becomes unwieldy. Another possible pattern is traits (again, you will find many useful discussions).
Another good principle to keep in mind for Julia is that ideally, only a very small set of functions should need to know what the internals of an object are. The rest should use the latter functions as accessors. This way you can easily refactor code and keep things organized.
Finally, note that code organization for Julia is something that is learned by exploring a lot of possible variations and finding a style that works for you. Reading code from others is usually helpful. So I would not worry too much about it, just get a prototype working and the keep improving it as the opportunity arises.
Just in case you didn’t know, in your code example MCMCGeneral isn’t concretely typed since Array has two type parameters, number type and number of dimensions - you might get a bit more speed by specifying it (or parameterising by it).