How does one initialize data into a module only once?

I need to load and parse a large file and have that data available before a module is used. I thought to put it in __init__ but discovered it was being called multiple times.

module A
function __init__()
  println("Loading data")
  @assert isempty(Data)
  Data[1] = "1"
  merge!(Data, loadData()) # load lots of data from a file
end
const Data = Dict()
end

The above module is used in various modules (via using), prints out “Loading data” multiple times (10 times in the case I just tried) during precompilation and the assert never fails. After some searching, I see this is expected behavior with an explanation here: module __init__() is called multiple times · Issue #30873 · JuliaLang/julia · GitHub

However, I’m a bit confused. This seems to invalidate what I had expected to be the primary use of __init__. If that isn’t the way to load some data before the module is used, what is? Is there some other idiom to load data into a module only once before it’s used?

Currently on Julia 1.7.3.

How are you “including” the above module? Do you literally mean include?

You should be loading the module via using or import. include should be used exactly once.

See Code Loading · The Julia Language for details.

Sorry for the confusion. I fixed the question. It’s used in various modules via using.

Thank you for clarifying.

The explanation refers to precompilation. This is the processes which caches type interference information to disk. Initially precompilation occurs in parallel. This process should only occur once for a given environment when you use Pkg.add, but also may be done implicitly in an interactive session via using.

There are two environment variables to can be used to influence precompilation behavior as documented here:
https://pkgdocs.julialang.org/dev/api/#Pkg.precompile

In particular, see the note about:

  • JULIA_PKG_PRECOMPILE_AUTO
  • JULIA_NUM_PRECOMPILE_TASKS

Could you tell us a bit more about how your code is structured and how you are loading your code? How are you building your environments? How do you add packages? Are you manipulating LOAD_PATH?

This is happening only in my own modules. There is no explicit Pkg.add call involved. I have added standard packages via ]add packagename but this is not related to any of those.

I have lots of modules that using and import each other. There are no dependency cycles (I wrote some code to check that). I’m working in the repl using Revise. Process is: edit module files, run things and the modules get reloaded by Revise.

I have added to LOAD_PATH. I have many modules defined across a hierarchy of directories for organization.

This process should only occur once for a given environment

Revise kicks off precompilation, right? So, it can happen any number of times.

I can understand why init() might be called like this. I’m just trying to find a new idiom that will do what I need. In particular, I was surprised that the data wasn’t there. When I use revise in the repl, store some data in a global in the module, edit that module file, revise reloads that module, the data I stored in the module is still there. So, even though it reloaded the module, it kept the data.

But apparently in this precompilation case, it is loading these in separate processes and throwing away the data, which makes sense if it’s only trying to analyze types, but… why call init then? I guess I expected init to be called at runtime (after all (pre)compilation is completed) rather than compile time (precompilation), though I understand there may be some blurring of those lines.

Parallel precompilation should only be done when manipulating the environment as in via Pkg.add.

Could you show the output of a REPL session where you see the problem? I’m confused what is triggering the multiple invocations of __init__ in your case.

If this is a Revise.jl issue, then the problem is a bit different. As far as I know Revise.jl does not do parallel precompilation, but there is a lot of magic happening behind the scenes. Do you see the issue once after each time you modify a file?

What we may need is an if statement to figure out if we are merely generating and caching code to disk or not. For example, the output of ccall(:jl_generating_output, Cint, ()) may be relevant here. See these two issues:

The way one might use this is as follows.

module A
function __init__()
  println("Loading data")
  @assert isempty(Data)
  Data[1] = "1"
  if ccall(:jl_generating_output, Cint, ()) != 1
    # We are actual loading the module for runtime, not caching code to disk.
    merge!(Data, loadData()) # load lots of data from a file
  end
end
const Data = Dict()
end
1 Like

This is now on Julia 1.8.0.

I set the __init__ function to the following (added an else print):

const TestData = Dict()
function __init__()
    println("Running init")
    @assert isempty(TestData)
    TestData[1] = "1"
    if ccall(:jl_generating_output, Cint, ()) != 1
        println("We are actual loading the module for runtime, not caching code to disk. TestData keys: ", keys(TestData))
        TestData[2] = "2"
    else
        println("Hit the else. TestData keys: ", keys(TestData))
        TestData[3] = "3"
    end
end

and here’s the output:

[ Info: [2022-08-29T06:22:26] Precompiling A [top-level]
Running init
We are actual loading the module for runtime, not caching code to disk. TestData keys: Any[1]
Loading mark time/dur
[ Info: [2022-08-29T06:22:57] Precompiling B [top-level]
[ Info: [2022-08-29T06:22:59] Precompiling C [top-level]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:23:01] Precompiling D [top-level]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:23:06] Precompiling E [top-level]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:23:08] Precompiling F [top-level]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:23:26] Precompiling G [top-level]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:23:33] Precompiling H [top-level]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:25:32] Precompiling I [top-level]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:26:15] Precompiling J [top-level]
Running init
Hit the else. TestData keys: Any[1]
Running init
Hit the else. TestData keys: Any[1]
[ Info: [2022-08-29T06:26:58] Precompiling K [top-level]
Running init
Hit the else. TestData keys: Any[1]

Here’s a self contained example.

julia> module A
           function __init__()
               println("A init")
           end
       end
A init
Main.A

julia> module B
           using ..A
       end
Main.B

julia> module C
           using ..A
       end
Main.C

I only see “A init” once here.

Based on your output, perhaps you are manipulating the LOAD_PATH? How are your files organized on disk and how are you getting them into Julia?

Let me try to recreate your situation. In a temporary folder I have created files A.jl, B.jl, C.jl, D.jl, and E.jl as follows:

$ find . -type f -print -exec cat {} \;
./A.jl
module A

const TestData = Dict()
function __init__()
    println("Running init")
    @assert isempty(TestData)
    TestData[1] = "1"
    if ccall(:jl_generating_output, Cint, ()) != 1
        println("We are actual loading the module for runtime, not caching code to disk. TestData keys: ", keys(TestData))
        TestData[2] = "2"
    else
        println("Hit the else. TestData keys: ", keys(TestData))
        TestData[3] = "3"
    end
end

end

./B.jl
module B
    using A
end

./C.jl
module C
    using A
end

./D.jl
module D
    using A
end

./E.jl
module E
    using A, B, C, D
end

In a Julia REPL, I then do the following.

$ julia --project=. --banner=no

julia> readdir()
5-element Vector{String}:
 "A.jl"
 "B.jl"
 "C.jl"
 "D.jl"
 "E.jl"

julia> push!(LOAD_PATH, pwd())
4-element Vector{String}:
 "@"
 "@v#.#"
 "@stdlib"
 "~/src/julia_module_test"

julia> using A
[ Info: Precompiling A [top-level]
Running init
We are actual loading the module for runtime, not caching code to disk. TestData keys: Any[1]

julia> using B
[ Info: Precompiling B [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using C
[ Info: Precompiling C [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using D
[ Info: Precompiling D [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using E
[ Info: Precompiling E [top-level]
Running init
Hit the else. TestData keys: Any[1]

If I then run this again, I do not see the same output. A.__init__ only runs once.

$ julia --project=. --banner=no

julia> push!(LOAD_PATH, pwd())
4-element Vector{String}:
 "@"
 "@v#.#"
 "@stdlib"
 "~/src/julia_module_test"

julia> using A
Running init
We are actual loading the module for runtime, not caching code to disk. TestData keys: Any[1]

julia> using B

julia> using C

julia> using D

julia> using E

To recreate the initial situation, I need to remove the A.ji file in my $JULIA_DEPOT_PATH/compiled/v#.#, which defaults to ~/.julia/compiled/v1.7 since I am using Julia 1.7.

$ rm ~/.julia/compiled/v1.7/A.ji # modify to v1.8 if you are using v1.8

$ julia --project=. --banner=no
julia> push!(LOAD_PATH, pwd())
4-element Vector{String}:
 "@"
 "@v#.#"
 "@stdlib"
 "/home/mkitti/src/julia_module_test"

julia> using A
[ Info: Precompiling A [top-level]
Running init
We are actual loading the module for runtime, not caching code to disk. TestData keys: Any[1]

julia> using B
[ Info: Precompiling B [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using C
[ Info: Precompiling C [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using D
[ Info: Precompiling D [top-level]
Running init
Hit the else. TestData keys: Any[1]

julia> using E
[ Info: Precompiling E [top-level]
Running init
Hit the else. TestData keys: Any[1]

So, in your example, notice it’s running init every time you call “using A/B/C/etc.” the first time around. That’s the problem. It might be more obvious if you don’t run using A first. Use one of the other modules first and then you will see two init’s at once. Also note that using Revise and then modifying a file will be equivalent of removing the file and will cause init to be called multiple times again.

The example can be simplified to just two files in the current directory:

A.jl:

module A
function __init__()
    if ccall(:jl_generating_output, Cint, ()) != 1
        println("A init generating output")
    else
        println("A init else")
    end
end
end

B.jl:

module B
using A
function __init__()
    println("B init")
end
end

Run julia:

julia> push!(LOAD_PATH, ".")
4-element Vector{String}:
 "@"
 "@v#.#"
 "@stdlib"
 "."

julia> using B
[ Info: Precompiling B [top-level]
A init else
A init generating output
B init

In your particular situation, we only encounter the “We are actual loading the module for runtime, not caching code to disk. TestData keys: Any[1]” case once. That’s when you should load your large data set. In the else block I recommend loading an abbreviated data set since this is only for caching some compiled code to disk.

In the larger scheme of things, the push!(LOAD_PATH, pwd()) method of loading a module is not recommended. The better way to do this would probably be to use Pkg.develop for your all encompassing project and perhaps make these all submodules of that project. In this case, the files would live within the src subdirectory. At the root, you would have a Project.toml and Manifest.toml. Let me know if you need more details on how this should work.

Thanks! That will work. It seems a little odd that I have to use a ccall for a situation like this, but that’s ok.

I have read that messing with LOAD_PATH is not recommended, but I haven’t seen anything that explains how to manage things better than doing that. This allows me to have ~100 files organized across 30 directories nested in any way, all under a single src folder that makes it easy to navigate and modify in a single project in VScode. Then in any given repl, I can run ‘using MyModuleX’ for just the things I’m working on at the time, edit any files anywhere, and it all “just works” (except constant redefining, but that’s understandable). My understanding is that doing this in packages would be much more complicated.

I’ll explain how to use Pkg.develop later today if no one beats me to it.