Closure: life of the referred-to object?

I’m not sure whether the object created in a function persists (not GC’ed) when it is referred to in a closure:

function ref_i(ref)
  function inner(i)
    ref[i] * 10
  end
  return inner
end

function get_func(r)
  a = [r,2r,3r,4r] # <--
  func = ref_i(a)
  return func
end

f = get_func(pi)
g = get_func(-1.5)
@show f(3)
@show g(3)

Will the Vector objects created in get_func() stay available from f() and g() ? The above code appears to be working as intended, but if the Vectors are GC’ed, it will stop working . . . I would think.

The above code is of course a toy program. In my real problem, the object I want to stay is a file handler (fh = NCDataset( . . . )). Because the file can be big, I don’t want to read all data into memory at once. So, I want a function that read a portion of the data, process it, and return the result.

But, at some point during the execution of my program, the file seems to be closed and the file handler seems to become invalid. I’m trying to fix this problem.

You pass a as the arg of ref_i(a), later captured by inner and returned to func, so becomes f. So the name f still has connection with the a you created.
As long as the name f is not re-associated to an other object, the Vector object a will not be GC-ed (I think).

2 Likes

That was exactly my thought! Assuming our conclusion is correct, I’ll look for other reasons why my file handler becomes invalid.

Module boundary seems to be doing something.

Here is a minimal, self-contained example that fails:

#--- file tmp_Debug.jl ---
module tmp_Debug

using NCDatasets

function ref_i(ref)
  @show ref["lat"][:] # -> Values are printed.
  function inner(i)
    @show ref # -> "Closed dataset".
    return ref["lat"][i]
  end
  return inner
end

function get_func(fnam)
  a = NCDataset(fnam, "r")
  return ref_i(a)
end

const f = get_func("http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA18/5_deg/annual/temp")
println("In the module: f(1) = $(f(1))")

end

# --- another file try-debug.jl ---
push!(LOAD_PATH,pwd())
using tmp_Debug
val = tmp_Debug.f(1)

When function f() is called outside the module, the file handle ref says “closed Dataset”. I guess this is the message the finalizer of the ref object left behind.

When function f() is called inside the module, it works as intended.

So, perhaps GC happens at using tmp_Debug ?

1 Like

ref having a usable object at all proved the file handler was not GCed, just closed in some way. I evaluated the module directly and ran using .tmp_Debug instead, and it worked fine. The most stark difference there is that I didn’t precompile the module, while your use of LOAD_PATH did.

Precompilation is a package-wise AOT compilation; what you execute in the module actually happens then and gets saved in precompile files to be loaded later at runtime. A file handle can’t remain open perpetually just because of a precompile file, obviously. From the docs:

Other known potential failure scenarios include:

3. Depending on compile-time side-effects persisting through load-time. Example include: modifying arrays or other variables in other Julia modules; maintaining handles to open files or devices; storing pointers to other system resources (including memory);

Precompilation can’t detect side-effects or runtime-initialize the file handler for you, you have to assign f in __init__…somehow. global const wouldn’t work, and although eval into the same module is nominally allowed, it has problems when triggered by being imported by another package being precompiled. A plain global f = ... would work, but you were probably trying to avoid that performance problem. Could help to annotate with ::NCDatasets.NCDataset{Nothing, Missing}, but that still has overhead compared to const.

2 Likes

Thank you for elucidating the problem!!! But, whose bug is this?

“Some way” . . . is the mystery. If it’s not the finalizer of the file-handle object, what closed the file?

Is modifying LOAD_PATH allowed to alter the semantics of the program? Or, does the language specification allow that the state of the object captured by const f to be destroyed by LOAD_PATH?

I thought precompilation is just an optimization, which shouldn’t alter the semantics of the program.

Is the document you quote discussing a known “bug”?

The limits of precompilation are documented, not a bug, and it’s rooted in fundamental limitations of what can be cached AOT. Just look at the still-evolving specification of C++'s constexpr, we have much easier lines to deal with.

You’re thinking of method call compilation, not module precompilation, which does a lot more than just optimization. It’s not really altering the language semantics, you’re just losing features to AOT limitations, whether it’s a thrown error or undefined behavior.

I don’t know either. For all I know, maybe the GC and thus the finalizer does run during the precompilation process after the object is cached. Or maybe just loading the object cached without its external state is enough to manifest a “closed” object. I’m guessing this could be found out if we removed the finalizer from NCDataset.

LOAD_PATH isn’t the direct reason for this, it’s precompilation. You could disable precompilation to evaluate the module at runtime from scratch and the issue would disappear.

It’s also not recommended anymore to do implicit environments or modify LOAD_PATH like this. There are still reasons for modifying LOAD_PATH, but that should be an edge case, not routine. If you want to work with precompilation, make an explicit package and dev/add it to an explicit environment with a project file. You’re also free to just make a module that you’ll only ever evaluate at runtime (manually include the file), never to be precompiled and distributed as part of any package.

The fact that it’s documented doesn’t solve this problem in practice. What is the resolution of this problem? If I understand what you are saying correctly:

  • When your program depends on an object that doesn’t survive precompilation, you should disable precompilation on a per-module basis.

If this is the final resolution of the issue,

  1. The programmer needs to know whether the object returned by a function will survive precompilation or not.
  2. We need a switch to disable precompilation on a per-module basis.

How do you achieve 1?

How do you switch off precompilation on the module side? As you explain, I could use include on the “using” side, but the decision to disable precompilation should be on the module side because it’s an implementation detail.

Who should do what, in practice, to avoid a problem like the one I encountered? Should the designer of the NCDatasets package do something differently?

That is not among the things I said. I actually think that’s (specifically, __precompile__(false)) generally a terrible idea because 1) precompilation can save time even if your code doesn’t serve as a dependency, and 2) it obstructs your code from being in dependencies for the generally precompilable packages. I said you should try to use __init__ to initialize things at runtime that fundamentally need to be, which doesn’t obstruct precompilation. AFAIK __precompile__(false) is only justified if you intend a module to be strictly evaluated interactively with include (not a package), and in that case, I wouldn’t even bother because storing the files in a clearly labeled scripts folder is enough. Note that by modifying LOAD_PATH (again, not recommended compared to dev/add to explicit environments), you’re trying to treat the module as a package, so it’s not being evaluated interactively.

Note that runtime initialization isn’t just about objects surviving, it’s about anything you need to do at runtime. For example, if you need a package to generate a random Int every session, doing it in the global scope will fail because it only runs during precompilation. The Int survives the AOT cache just fine, you’ll just keep getting the same cached value every session.

The docs explain most of the principles, but I don’t think it’s comprehensive or obvious. Sometimes people really just find out by running into objects that don’t work like this.

I think I now understand your point!

Before proceeding . . . our premise should be: The same module should work whether it’s precompiled or not, because precompilation is purely an optimization and shouldn’t change the semantics of the program.

From this premise, we can conclude that we should be writing:

module MyMod
# . . .
# const f = get_func() # May not work.
f = nothing
function __init__()
  global f
  f = get_func()
end

if the object returned by get_func() may not survive precompilation.

So, this is the resolution of the problem.

Here, whether the module is treated as a package or just to be included is irrelevant, because the behavior of the module shouldn’t change between the two cases.


By the way, I have the impression that the distinction between compile time from run time was initially vague in the history of Julia, because the behavior (except for performance) of the program didn’t change whenever compilation happened. But, by the introduction of precompilation, one needs to be aware.

A module that works fine when not precompiled might fail if precompiled because some initialization steps fundamentally can’t be cached, this post is one example.

Again, no, precompilation does much more than optimization.

Technically yes, but language semantics don’t concern lost features and undefined behavior. I think your repeated misconceptions are rooted in conflating semantics with behavior. Semantics are about what the expressions mean, they don’t dictate what happens. If I allocate a 1GB Vector{Int}, it’s going to mean that regardless of whether it runs fine on my 8GB laptop or crashes a 1GB desktop from the 2000s. Precompilation is another context that influences behavior: it either errors a few un-cache-able actions like method overwriting, but otherwise lets un-cache-able programs run for you to encounter these issues after importing.

Maybe, I don’t know if that’s the best way to do it. f = nothing will have poor type inference, I think substituting global f::NCDataset{Nothing, Missing} will at least restore that, if your example’s call’s resulting type is intended.

Module precompilation and its limits have been around for a long time in Julia’s history. Again, it’s not the same as (but involves) a method call’s compile-time we distinguish from its runtime.

1 Like

I can see that my wording was confusing, but I can also see that you understand me and I understand you.

First, replace my “semantics” with “behavior”. Then, my words will be less confusing.

Second, interpret my “should” in “precompilation should be a pure optimization” as

  • What’s the best practice to ensure that the behavior (except for performance) of a module doesn’t change whether precompilation is involved or not?
  • How to write your source code to ensure that precompilation be “transparent”?

That was my “should”. If possible, you usually want to write a code for which precompilation is transparent, don’t you?

Finally,

That was my question, too. I prefer the behavior of const for that reason, but I don’t know how to ensure the correct typing of f. Remember the example I gave above is highly simplified. Is something like this possible?

module MyMod
# . . .
global f :: typeof(get_func())
function __init__()
  global f
  f = get_func()
end

Untested. I don’t know how to use the global keyword.

I don’t really know either, and there isn’t really an official guide either. In fact, the Modules page in the Manual has blatantly outdated language and references.

If I’m interpreting your suggestion correctly, an unassigned global variable is annotated with the runtime return type of a call (get_func() is just an example) executed in the precompilation phase, and __init__ assigns that variable with a repeated call’s value. That’s possible, but a few possible problems:

  • if get_func() is type-unstable (in other words, abstract inferred type for return value) and isn’t pure, then the 2nd call could have an incompatible type. Obviously manually annotating the type as I suggested earlier would also be vulnerable to this.
  • if get_func() is expensive, that’s contributing significant overhead to precompilation for a value you have to recompute at __init__ anyway. If you design get_func() to be type-stable in a particular version, you can compute the type in development and manually annotate the variable to save this time. Precompilation is used in practice to compute things at installation, but it’s just as acceptable to do it even sooner in development.

For this scenario more specifically, I’m skeptical that a file handle should be opened in the global scope of a package, a dependency to distribute and precompile. This case involves a read-only one at least, but if that file handle gets closed, f needs to be manually reassigned to get_func(). If we have to go that far to keep things working, why not use the standard practice of manually opening various file handles and inputting them as arguments? It really does appear as if you intend this module to be a namespace in a script you execute as part of a wider interactive session, not a true package (the Modules page often failing to distinguish packages from wider modules is one example of outdated language).

That is a good direction of discussion, but the answer depends on specifics. In general, your skepticism is right.

In my particular case, however, I don’t know what better solutions exist.

I compare various methods of calculation in a single program:

using Meth1
using Meth2

func1 = Meth1.f # function of integer
func2 = Meth2.f # function of integer

ns = 0:30
lines!(ns, func1.(ns))
lines!(ns, func2.(ns))

Meth1 internally reads a set of data from a file, applies some calculation, and presents the result as a function of integer. Meth2 doesn’t depend on a data file; it just defines and gives the function.

In other words, I use modules to hide implementation details and to give a unified interface to the main program.

I’m sorry I don’t understand this part of your message. How do you write a module-as-name-space differently from a module-as-package? If the distinction is fundamental, shouldn’t Julia introduce package as part of the language specification?

Without quite understanding your point about package vs namespace, I continue the discussion . . . I start a module as a namespace, just to put related data and methods together and present a well defined interface to the outside world. But, eventually, some of my modules turn out to be useful enough to be used in other programs. . . . In such a case, these modules acquire a flavor of “packages”, even though I stay away from the formal packaging convention (Manifest etc.) as all my projects are small and solo.

Also, as in the above example, I sometimes create another module to give the same API as the existing one. This is a traditional and proper use of modules. I don’t think, however, the design of these modules should be affected by whether they are “packages” or “namespaces” . . . or should they?

Modules are always namespaces, a space to isolate names is the fundamental characteristic of a Julia module.

It does, just not as a distinct keyword. A package is a matter of how a module is loaded, managed, and distributed. Julia also specifies package extensions, precompilable modules loaded in a different way from packages; extensions can’t even be using/import-ed. The Manual could probably explain these concepts in a better way, but it does.

All the design limitations mentioned here so far is about precompilation, so if a module would never be precompiled, then there’s no reason to write it with those limitations.

I was speaking of interactivity and scripts as the typical situations where you wouldn’t precompile a module, but terminology and examples might be clearer. Interactivity is about manual interaction with the Julia REPL. A script is a (usually shorter) file containing Julia code to be evaluated on demand.

julia> include("ModX.jl") # evaluate a script
Main.ModX

julia> module ModX # directly type into REPL
         x = 1
       end
WARNING: replacing module ModX.
Main.ModX

julia> using .ModX: x as x2 # preceding dots, unlike package imports

julia> x2
1

We could also just run the entire Julia process only for the script, no interaction with the REPL:

PS C:\Users\Benny> julia ModX.jl

In all these cases, the module was just a namespace in code evaluated on our demand. It wasn’t precompiled (in more words, executed during a precompilation phase to be cached) as part of a package or a package extension. Note that submodules of a package or package extension would be precompiled along with them, so dots in imports don’t generally indicate lack of precompilation.

When you added your directory of .jl files to LOAD_PATH, you’re treating those files as packages inside environments for using/import (no dots). Again, there are other proper ways to develop a package, and this is not how anybody should routinely load a module. I’m surprised you ever ran into this approach, I don’t know of a source that suggests it. The Manual does mention package directories (also no longer recommended), but that has src folders.

I had a similar problem some time back. I used the following solution:

module tmp_Debug

using NCDatasets

mutable struct Mydataset
    const url::String
    ref::NCDatasets.NCDataset{Nothing,Missing}
    Mydataset(url) = new(url)
end
init(d::Mydataset) = (isdefined(d, :ref) && isopen(d.ref)) || (d.ref = NCDataset(d.url,"r"))

struct FunType{T} <: Function
    data::Mydataset
    FunType{T}(s) where T = new{T}(s)
end

function (f::FunType{:latindex})(i)
    init(f.data)
    f.data.ref["lat"][i]::Float64
end

function (f::FunType{:lonindex})(i)
    init(f.data)
    f.data.ref["lon"][i]::Float64
end

const dataset = Mydataset("http://apdrc.soest.hawaii.edu:80/dods/public_data/WOA/WOA18/5_deg/annual/temp")
const lat = FunType{:latindex}(dataset)
const lon = FunType{:lonindex}(dataset)

println("In the module: lat(1) = $(lat(1))")
println("In the module: lon(1) = $(lon(1))")

end

with the script:

push!(LOAD_PATH,pwd())
using tmp_Debug: lat, lon

println("In script: lat(1) = $(lat(1))")
println("In script: lon(1) = $(lon(1))")

It’s also possible to let FunType{T} <: AbstractVector and define Base.getindex for them instead, so you can access by lat[i] and lon[i]. Note that the indexing ref["lat"] is type unstable, so for performance it might be an idea to do this only once, so the lat(i) indexes into a stored reflat.