Application Code Organization

I am developing what I think would be called an application. It is meant for a novice end user that can just change some inputs and then run. It will compute some quantities and then generate a report with text and plot outputs. I would like some input on proper code organization which will be 1.) performant, 2.) organize namespaces/dependencies, and 3.) maintain usability.

Current Structure
This application includes a package and the code to run it in the same repository. It is currently organized as below.

├──run # Part of the repository but not formally part of the package
│   ├──main.jl # using MyPackage; using OtherPackages1; define inputs; include("run*.jl")
│   ├──runfracture.jl # read fracture data; call fracture functions; produce nice outputs;
│   ├──runlocalfailure.jl # read local failure data; call local failure functions; produce nice outputs;
│   ├──runpackagemanager.jl # import Pkg; Pkg.activate("PathToMyPackage"); Pkg.instantiate()
│   ├──runvalidation.jl # read validation data; call validation functions; produce nice outputs;
│   ├──MyPackage.jl # module MyPackage; export functions; using OtherPackages2; include ("*.jl")
│   ├──auxiliary.jl # contains functions needed by other package files/functions
│   ├──fracture.jl # contains fracture functions
│   ├──localfailure.jl # contains local failure functions
│   ├──validation.jl # contains validation functions
│   ├──manualtesting.jl # where I tinker
│   ├──runtest.jl # formal test wrapper
│   ├──testset.jl # formal test set

The end user would typically only change inputs in main.jl, but could also modify run*.jl to change plot labels or other small tweaks.

Thoughts and Issues

  1. Functional Programming and Global Variables
    The run files are growing into large procedural scripts with lots of global variables as I make the outputs prettier and the inputs more flexible. I know these scripts should be broken down into small functions, but then I don’t know where to put them. What is an acceptable number/size of global variables? Do I just make the whole script a function and call it from main.jl with runvalidation() instead of include("runvalidation.jl")? I could move run functions into the package src, but they would be too specific to be reusable. Maybe I need to make a RunMyPackage sub-package? Maybe wrapping everything in one big main() function is the easiest solution? Ideally, the code inside run would be simple enough for someone who doesn’t know Julia to still be able to follow along.

  2. Namespaces and Dependencies
    I thus far haven’t bothered with separate namespaces, but I think I need to start. The obvious choice is to make fracture.jl, localfailure.jl, and validation.jl their own modules since they are independent apart from auxiliary.jl, but then I wondered if that was necessary since they just contain function definitions. Likewise, should the independent runfracture.jl, runlocalfailure.jl, and runvalidation.jl be separated by modules even if those scripts become functions when I fix Issue #1. Does passing inputs through functions eliminate the need to worry about collisions and modules? How would modules in the run scripts interact with modules in the package src?

  3. Ease of Use
    My current instructions for users say to copy the run directory from the package location to a project folder containing the required data, modify main.jl with proper local paths to the data, and then run main.jl. This works well enough, but I wasn’t sure if there was a more common/streamlined procedure. The DrWatson package probably has the best approach, but my end users won’t necessarily know Julia or Git.

These can’t be standard functions accepting possibly large number of optional inputs? I have organized one package also thought to be run by non-julia users, with something like:

The data of the problem is defined by the constructor of a struct which receives at least the mandadory options, and sets up everything that is optional to standard values:

fracture_data = FractureData("data_file.dat",optional=1.,other_option=2.)

Then some function runs whatever has to be run, and produces results (in my case this takes a long time, so I have decoupled it from the nice outputs, and allowed the user to save the result to a file for possible later reading):

results = compute_results(fracture_data)

And I have then some functions to produce beautiful results:


Finally, I provide an example script, which is more or less “julia-independent”, and self explicative: Here is one example: Quick Guide · ComplexMixtures. At the end the user writes such example to a file and runs it with Julia. Has worked nicely for the moment, although most of the users are still my students.

Where do you keep the definitions of FractureData, compute_results, and plotresult; at the top of some “run” file or split off into separate files/directories?

Uhm… my package is probably not as complex as yours. The FractureData in my case is a structure which is defined main module of the package (in my case the data is what I call the Trajectory, and the main module is called ComplexMixtures). The functions also are all at level the main module, defined in individual files. Thus I have a simple module like:

module ComplexMixtures
  ... etc.

and Trajectory.jl defines the Trajectory struct and its constructors, which have many optional parameters. The user only sees that he/she has to call this constructor with, at least, a file name:

mytraj = Trajectory("./mytraj.dcd")

I do not have a run directory. I will write some examples, and save them at an examples directory, but mostly the running script is simple and can be copied/pasted from the docs directly.

I do have one subdirectory src/trajectory_formats in which I define some functions to read different types of “Trajectories”, which a more experienced user might want to modify or complement with a new type. But most (if not all) users will never look there.

Thus, my package is quite “flat”. What you might get from it is the fact that the user can read the sequence of commands on how to use it without knowing anything about Julia. Except from the usings it could be python or anything else, no programming knowledge is needed to follow the example.

My MyPackage.jl and main.jl files are successfully simple and I think similar to your approach. I am having trouble simplifying and organizing the code in run*.jl which is one step deeper. The additional complexity in my run*.jl files comes from having to organize the data and do different things based on what data is available. For example, runvalidation.jl will compute and plot theoretical values if no file is available. However, if there is a file to compare against, then the theoretical values are computed using the same discretization, several error metrics are computed, and some additional plots comparing the values are shown. For runfracture.jl, the user can provide a single file, a pair of files, a vector of single files, or a vector of pairs of files. The report needs to be pretty and make sense in all instances, so I need logic to handle all of that. I don’t know where and how to store functions that handle that kind of pre- and post-processing complexity. Right now, that logic is all laid out sequentially in these run*.jl files, but they are long to read. I am also afraid that the number of global variables from these long logic puzzles are hurting my performance.

In my case all the complexity of the possible variations of the user input are handled in the Trajectory generator.

For example (in my case) the user can provide information about two different species (the solute and the solvent, or only one of them). I deal with that with multiple dispatch. If the user calls


one generator is called which deals with that kind of data. If the user calls


another generator is called which does what it has to do.

This Trajectory constructor also receives possibly many optional parameters.

Of course I don’t know if that applies to your case, but one can imagine that you could define a single function or constructor which, if does not receives that data file as input, calls the runvalidations to generate the theoretical values, or does what it has to do if it does receive the file as input. Also, the runfracture could have different methods depending on the number of files provided. I would try to hide all that from the user using multiple method definitions.

I think I am confused about the common advice to avoid global variables in general though. Won’t any variable you want returned from a function to the user appear in the global scope? In that sense, wrapping the entire code in main() seems to be the only way to truly avoid using global variables.

Okay, I could see multiple dispatch on functions helping to solve the pre- post-processing issue. You chose to put all that functionality inside the package itself, correct? I am wondering if I should do the same or do something like create a separate “helper” package. I guess it depends how generally applicable I want to keep the original package.

The important part is to avoid global variables in the part of the code that must be fast. For example:

julia> module MyPackage
         struct A
           x :: Vector{Float64}
         function mysum(a::A)
           s = 0.
           for i in 1:length(a.x)
             s += a.x[i]
         export A, mysum

julia> using .MyPackage

julia> myA = A(rand(10000)); # myA is global

myA is global for the user, but that does not imply that it will be global when passed to the mysum function, which will run type-stable and fast:

julia> @btime mysum($myA)
  10.029 μs (0 allocations: 0 bytes)

julia> @code_warntype mysum(myA)
  #self#::Core.Compiler.Const(Main.MyPackage.mysum, false)
  @_4::Union{Nothing, Tuple{Int64,Int64}}

1 ─       (s = 0.0)
│   %2  = Base.getproperty(a, :x)::Array{Float64,1}
│   %3  = Main.MyPackage.length(%2)::Int64
│   %4  = (1:%3)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])
│         (@_4 = Base.iterate(%4))
│   %6  = (@_4 === nothing)::Bool
│   %7  = Base.not_int(%6)::Bool
└──       goto #4 if not %7
2 ┄ %9  = @_4::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (i = Core.getfield(%9, 1))
│   %11 = Core.getfield(%9, 2)::Int64
│   %12 = s::Float64
│   %13 = Base.getproperty(a, :x)::Array{Float64,1}
│   %14 = Base.getindex(%13, i)::Float64
│         (s = %12 + %14)
│         (@_4 = Base.iterate(%4, %11))
│   %17 = (@_4 === nothing)::Bool
│   %18 = Base.not_int(%17)::Bool
└──       goto #4 if not %18
3 ─       goto #2
4 ┄       return s

Thus, the user can define all its variables in the global scope, if you provide interfaces (function barriers) to which the user must pass the data. Those functions will run fast.

1 Like

Well, I find that confusing. I prefer examples.

Advice regarding global variables is usually referred to situations like this

A = 1
function f(x)
    return A + x

f(1) # not ok

julia> @code_warntype f(1)

1 ─ %1 = (Main.A + x)::Any
└──      return %1

i.e. when you are using variables from the global scope inside the function. If you are passing it as an argument it is not an issue

A = 1
function f(x, B)
    return B + x

f(1, A) # ok

julia> @code_warntype f(1, A)

1 ─ %1 = (x + B)::Int64
└──      return %1

Sorry, I am just wondering again about the best place to put function calls and definitions that deal with pre-processing or post-processing. These function calls will likely live inside runfracture.jl etc. However, the definitions could exist in the same file as the call, a separate file outside of the package, a separate file inside of the package src, or just be added to the bottom of the existing src files. I wasn’t sure if these types of functions/structs should be packaged separately or added to the existing package.

I would define precise functions for each of these operations, and write them one in each file, and put those files in the src directory, including them in the main module of the package.

module MyPackage
   export preprocess, computethings, prostprocess

such that the user can do (and what follow I provide as an example):

using MyPackage
preprocessed_data = preprocess("userfile.dat")
result = computethings(preprocessed_data)
postprocessed_data = postprocess(result)

where preprocessed_data, result and postprocessed_data are instances of corresponding structs that contain the data organized as you think is reasonable.

1 Like

There is no such thing as methods for structs right? Would you just set fields to nothing if a file does not exist? I’m not sure how to make data structures work as flexibly as function method definitions.

That is one alternative. Another is to let the structure types “invisible” to the user. For example:

This would be implemented in the package:

julia> struct WithFile
         f :: String

julia> struct WithoutFile

julia> preprocess_data() = return WithoutFile()
preprocess_data (generic function with 1 method)

julia> preprocess_data(f::String) = return WithFile(f)
preprocess_data (generic function with 2 methods)

julia> compute(data::WithoutFile) = "Computation without file"
compute (generic function with 1 method)

julia> compute(data::WithFile) = "Computation with file: $(data.f)"
compute (generic function with 2 methods)

The “user” starts here:

julia> user_data_without_file = preprocess_data()

julia> compute(user_data_without_file)
"Computation without file"

julia> user_data_with_file = preprocess_data("")

julia> compute(user_data_with_file)
"Computation with file:"

Actually there are, the functors.


Woah, I never would have thought of that!

1 Like

Any comments/suggestions for 2.) submodules or 3.) usage instructions?

In my particular case I have not used submodules.

My way of writing usage instructions is through examples. But that is me. This is one example of how I write manuals, of course I like that style, but others might like different ways to explain things. At the same time, while I do not consider myself a great programmer, what I usually do well is to make things easy for others to learn and use.