Mysterious memory allocation

Ross_Boylan · January 23, 2023, 9:41pm

Using julia --track-allocation identifies the top hotspots in my code for memory allocation:

       - function zSQdensity(z::Float64, wa::WorkArea)
        0     ev = wa.evaluator
        0     dat::DataFrame = wa.dat
        0     objective::Objective = wa.objective
        0     if objective == justZ || objective == just1
        0         d = exp(-0.5 * z^2)
        -     else
1356970560         d = exp(-0.5 * (1.0 - 2.0 * ev.λ) * z^2)
        -     end
190084928     for i in wa.i_start:wa.i_end
2062343712         Y = dat.Y[i]
        - 
        -         # conditional Y=1 | z
        -         # next line gets most of the CPU time
11654907264         cd = logistic(z*ev.σ + ev.k)
        0         if Y
748204864             d *= cd
        -         else
4331043904             d *= (1.0-cd)
        -         end
        - 
        0     end
        0     if objective == justZ || objective == WZ
442565376         d *= z
        -     end
        0     return d
        - end

logistic is from StatsFuns and is 1/(1+e^{-x}). I am baffled that any allocation is going on. The top spot is cd = logistic(z*ev.σ + ev.k) and all the quantities involved are scalars. I figured the compiler would be doing the math in registers.

Is it just that d, Y and cd are allocated fresh at each iteration? But Y and cd are allocated the same number of times and yet have much different total bytes allocated. For that matter, d *= cd multiplies an existing number. Why does that cause any new memory allocation?

Or maybe this is the profiling version of a Heisenbug, in which I’m actually tracking the overhead of the profiler?

These functions are in an inner loop that is called a lot, and so it’s not surprising they should account for a lot of the load. But I don’t see how they are allocating memory, and so I don’t see how to reduce it.

It may be relevant that the work is going on within a thread, although julia was running single-threaded for the test. That is, Threads.nthreads() was 1, but Threads.@spawn started some tasks within which the computation above ran.

Thanks.
Ross

Oscar_Smith · January 23, 2023, 9:43pm

This looks like type instability. What does @code_warntype say?

Ross_Boylan · January 23, 2023, 9:56pm

I’m not sure which function you’re asking about. For the outer function shown above, the result is

MethodInstance for getproperty(::Module, ::Symbol)
  from getproperty(x::Module, f::Symbol) in Base at Base.jl:31
Arguments
  #self#::Core.Const(getproperty)
  x::Module
  f::Symbol
Body::Any
1 ─      nothing
│   %2 = Base.getfield(x, f)::Any
└──      return %2

and for StatsFuns.logistic

MethodInstance for getproperty(::Module, ::Symbol)
  from getproperty(x::Module, f::Symbol) in Base at Base.jl:31
Arguments
  #self#::Core.Const(getproperty)
  x::Module
  f::Symbol
Body::Any
1 ─      nothing
│   %2 = Base.getfield(x, f)::Any
└──      return %2

which is the same; it looks more as if it’s telling me about looking up the symbol, e.g., logistic, than about what it does…

Ross_Boylan · January 23, 2023, 10:23pm

I guess I need to invoke @code_warntype on a particular call.

julia> @code_warntype logistic(1.3)
MethodInstance for LogExpFunctions.logistic(::Float64)
  from logistic(x::Union{Float16, Float32, Float64}) in LogExpFunctions at C:\Users\rdboylan\.julia\packages\LogExpFunctions\1QuJn\src\basicfuns.jl:106
Arguments
  #self#::Core.Const(LogExpFunctions.logistic)
  x::Float64
Locals
  @_3::Int64
  upper::Float64
  lower::Float64
  e::Float64
Body::Float64
1 ─       (e = LogExpFunctions.exp(x))
│   %2  = LogExpFunctions._logistic_bounds(x)::Core.Const((-744.4400719213812, 36.7368005696771))
│   %3  = Base.indexed_iterate(%2, 1)::Core.Const((-744.4400719213812, 2))
│         (lower = Core.getfield(%3, 1))
│         (@_3 = Core.getfield(%3, 2))
│   %6  = Base.indexed_iterate(%2, 2, @_3::Core.Const(2))::Core.Const((36.7368005696771, 3))
│         (upper = Core.getfield(%6, 1))
│   %8  = (x < lower::Core.Const(-744.4400719213812))::Bool
└──       goto #3 if not %8
2 ─ %10 = LogExpFunctions.zero(x)::Core.Const(0.0)
└──       return %10
3 ─ %12 = (x > upper::Core.Const(36.7368005696771))::Bool
└──       goto #5 if not %12
4 ─ %14 = LogExpFunctions.one(x)::Core.Const(1.0)
└──       return %14
5 ─ %16 = e::Float64
│   %17 = LogExpFunctions.one(x)::Core.Const(1.0)
│   %18 = (%17 + e)::Float64
│   %19 = (%16 / %18)::Float64
└──       return %19

Not sure what that all means, in particular whether it is type-stable. It looks as if depends on whether Core.Const(x.y) is returning Float64.

This is not a perfect reproduction of my code flow since the debugger isn’t letting me evaluate it in the context from which logistic is called. However, the debugger shows that all 3 variables in z*ev.σ + ev.k are Float64. So I think calling it 1.3 as an argument will have the same type behaviors.

abulak · January 23, 2023, 10:36pm

this looks just fine; give us @code_warntype zSQdensity(z, wa); is wa.ev concrete? can you show the definition of wa?

Ross_Boylan · January 23, 2023, 11:34pm

Here are the results for zSQdensity. But is the type-stability of the “outer” function going to affect the behavior of the line with the logistic()? Hmm, the 3 variables associated with so much allocation, cd, d and Y are all of type Any. That probably has something to do with the trouble …

julia> @code_warntype zSQdensity(1.0, wa)
MethodInstance for MSEP.zSQdensity(::Float64, ::MSEP.WorkArea)
  from zSQdensity(z::Float64, wa::MSEP.WorkArea) in MSEP at C:\Users\rdboylan\Documents\BP\MSEP\src\logistic_simple_evaluator.jl:86
Arguments
  #self#::Core.Const(MSEP.zSQdensity)
  z::Float64
  wa::MSEP.WorkArea
Locals
  @_4::Union{Nothing, Tuple{UInt64, UInt64}}
  d::Any
  objective::MSEP.Objective
  dat::DataFrame
  ev::Evaluator
  i::UInt64
  cd::Any
  Y::Any
Body::Any
1 ──       Core.NewvarNode(:(@_4))
│          Core.NewvarNode(:(d))
│          (ev = Base.getproperty(wa, :evaluator))
│    %4  = Base.getproperty(wa, :dat)::DataFrame
│    %5  = Base.convert(MSEP.DataFrame, %4)::DataFrame
│          (dat = Core.typeassert(%5, MSEP.DataFrame))
│    %7  = Base.getproperty(wa, :objective)::MSEP.Objective
│    %8  = Base.convert(MSEP.Objective, %7)::MSEP.Objective
│          (objective = Core.typeassert(%8, MSEP.Objective))
│    %10 = (objective == MSEP.justZ)::Bool
└───       goto #3 if not %10
2 ──       goto #4
3 ── %13 = (objective == MSEP.just1)::Bool
└───       goto #5 if not %13
4 ┄─ %15 = Core.apply_type(Base.Val, 2)::Core.Const(Val{2})
│    %16 = (%15)()::Core.Const(Val{2}())
│    %17 = Base.literal_pow(MSEP.:^, z, %16)::Float64
│    %18 = (-0.5 * %17)::Float64
│          (d = MSEP.exp(%18))
└───       goto #6
5 ── %21 = Base.getproperty(ev, :λ)::Any
│    %22 = (2.0 * %21)::Any
│    %23 = (1.0 - %22)::Any
│    %24 = Core.apply_type(Base.Val, 2)::Core.Const(Val{2})
│    %25 = (%24)()::Core.Const(Val{2}())
│    %26 = Base.literal_pow(MSEP.:^, z, %25)::Float64
│    %27 = (-0.5 * %23 * %26)::Any
└───       (d = MSEP.exp(%27))
6 ┄─ %29 = Base.getproperty(wa, :i_start)::UInt64
│    %30 = Base.getproperty(wa, :i_end)::UInt64
│    %31 = (%29:%30)::UnitRange{UInt64}
│          (@_4 = Base.iterate(%31))
│    %33 = (@_4 === nothing)::Bool
│    %34 = Base.not_int(%33)::Bool
└───       goto #12 if not %34
7 ┄─ %36 = @_4::Tuple{UInt64, UInt64}
│          (i = Core.getfield(%36, 1))
│    %38 = Core.getfield(%36, 2)::UInt64
│    %39 = Base.getproperty(dat, :Y)::AbstractVector
│          (Y = Base.getindex(%39, i))
│    %41 = Base.getproperty(ev, :σ)::Any
│    %42 = (z * %41)::Any
│    %43 = Base.getproperty(ev, :k)::Any
│    %44 = (%42 + %43)::Any
│          (cd = MSEP.logistic(%44))
└───       goto #9 if not Y
8 ──       (d = d * cd)
└───       goto #10
9 ── %49 = d::Any
│    %50 = (1.0 - cd)::Any
└───       (d = %49 * %50)
10 ┄       (@_4 = Base.iterate(%31, %38))
│    %53 = (@_4 === nothing)::Bool
│    %54 = Base.not_int(%53)::Bool
└───       goto #12 if not %54
11 ─       goto #7
12 ┄ %57 = (objective == MSEP.justZ)::Bool
└───       goto #14 if not %57
13 ─       goto #15
14 ─ %60 = (objective == MSEP.WZ)::Bool
└───       goto #16 if not %60
15 ┄       (d = d * z)
16 ┄       return d

Here’s the definition of the WorkArea. The zSQdensity() function is stored inside the evaluator:

mutable struct  WorkArea
    """
    This is the entire data frame.  An individual run will only work with
    a few rows.

    This only needs to be set once at the start of the thread
    """
    const dat::DataFrame

    """
    An evaluator, such as that above.
    Also only set once and shared between threads
    """
    const evaluator::Evaluator

    "working space for integrator
    This is created at the start but written to constantly."
    segs
    
    # The following are set on each evaluation
    "dirty trick to determine whether to integrate over 1, z, w, or wz"
    objective::Objective

    "first row index of cluster of current interest"
    i_start::UInt

    "last row index of cluster, inclusive"
    i_end::UInt

    "index of cluster for output"
    i_cluster::UInt

end

abulak · January 23, 2023, 11:44pm

julia fails to in infer the type of d probably because the type of ev.λ is not known at compile time (i.e. it’s type instability)
i don’t personally work with DataFrames, so I’m not sure why is Y not inferred correctly;
since ev.σ is probably also Any, so is cd.

what is the definition of Evaluator?

DNF · January 24, 2023, 12:04am

I don’t know Dataframes well either, but how would the compiler possibly know the type of the field Y? Afaik, Dataframes are deliberately type instable.

Unless every field of Evaluator is concretely specified, those types, too, are inaccessible and indeterminate.

Ross_Boylan · January 24, 2023, 12:10am

Evaluator is an abstract type. In this case it is the type shown below. λ and k do not have types specified, but they are const and, I would think, known by the time JIT gets to them. I guess formally they remain untyped. f is what holds the zSQdensity.

"""
Evaluates data as produced by `maker()` with a simple mixed logistic model
We only use `Y`, a binary indicator, since the model has no observed covariates.
"""
mutable struct  LogisticSimpleEvaluator <: Evaluator
    "parameter for weight function"
    # const requires julia 1.8+
    const λ

    "parameters for the regression part of the model"
    const k

    "parameters for random effect distn"
    const σ

    "order for the numerical integration"
    const integration_order::Integer

    ## The constructor is responsible for the following
    "f(z, workarea)= w(z)*conditional density*normal density
    or, if withZ is true, z*w(z)*...."
    const f

    "Short name of primary estimand, e.g., zSQ"
    const targetName::String

    "used to integrate f(z) over the real line"
    const integrator

    "short name of numerical integration method"
    const integratorName::String

    "fuller description integration method"
    const integratorDescription::String
end

DNF · January 24, 2023, 12:15am

The jit only knows types, not values. No help in const. Everything will be dynamically dispatched then, which means slowness and allocations.

There’s your culprit.

Ross_Boylan · January 24, 2023, 1:48am

I thought adding types inside the function would help, but it made very little difference. My reading of the manual is that the local declarations at the top should set the type even inside inner scopes such as for loops.

I haven’t yet followed @DNF’s suggestion to fix the type of the parameters. I’m still baffled why all the memory allocation is going on.

       - function zSQdensity(z::Float64, wa::WorkArea)
        0     ev = wa.evaluator
        0     dat::DataFrame = wa.dat
        0     objective::Objective = wa.objective
        -     local d::Float64
        -     local cd::Float64
        -     local Y::Bool
        0     if objective == justZ || objective == just1
        -         # if this doesn't work may want Gauss-Hermite quadrature
        0         d = exp(-0.5 * z^2)
        -     else
1354818432         d = exp(-0.5 * (1.0 - 2.0 * ev.λ) * z^2)
        -     end
        0     for i in wa.i_start:wa.i_end
2058822304         Y = dat.Y[i]
        - 
        -         # conditional Y=1 | z
        -         # next line gets most of the CPU time
11635914752         cd = logistic(z*ev.σ + ev.k)
        0         if Y
        0             d *= cd
        -         else
        0             d *= (1.0-cd)
        -         end
        - 
        0     end
        0     if objective == justZ || objective == WZ
        0         d *= z
        -     end
        0     return d
        - end

Ross_Boylan · January 24, 2023, 4:55am

I tried giving the parameters types, but it only seems to have made things worse:

mutable struct  LogisticSimpleEvaluator{TParam} <: Evaluator where TParam
    const λ::TParam
    const k::TParam
    const σ::TParam
   # rest as before
end

Key line in the result:

17461086720         cd = logistic(z*ev.σ + ev.k)

So allocation is now 17.5G, up from about 11.6G before.
Note the WorkArea from which the evaluator is pulled remains typed abstractly as const evaluator::Evaluator.

Ross_Boylan · January 24, 2023, 5:57am

Giving WorkArea a more specific (parameterized) type for the evaluator made a huge difference.

mutable struct  WorkArea{TEvaluator}
    const dat::DataFrame
    const evaluator::TEvaluator
   # rest as before
end

This reduced allocations to 0 in the key line as well as some others. The line setting the Y value remains very active; that may be from the type looseness of DataFrame that @DNF referred to.

What general lessons should I take away from this? The whole thing is starting to feel a lot like C++ templates, i.e., ugly and complex.

The original declaration gave WorkArea.evaluator an abstract type, Evaluator; I thought julia’s machinery would then deal with the specific types I ran with. But it seems not to do that too well unless the type is parametric. I guess this is an example of the principle that one should avoid containers with abstract types.

        - function zSQdensity(z::Float64, wa::WorkArea)
        0     ev = wa.evaluator
        0     dat::DataFrame = wa.dat
        0     objective::Objective = wa.objective
        -     local d::Float64
        -     local cd::Float64
        -     local Y::Bool
        0     if objective == justZ || objective == just1
        -         # if this doesn't work may want Gauss-Hermite quadrature
        0         d = exp(-0.5 * z^2)
        -     else
        0         d = exp(-0.5 * (1.0 - 2.0 * ev.λ) * z^2)
        -     end
        0     for i in wa.i_start:wa.i_end
2058731840         Y = dat.Y[i]
        - 
        0         cd = logistic(z*ev.σ + ev.k)
        0         if Y
        0             d *= cd
        -         else
        0             d *= (1.0-cd)
        -         end
        - 
        0     end
        0     if objective == justZ || objective == WZ
        0         d *= z
        -     end
        0     return d
        - end

P.S. The switch to a parametric type for WorkArea made the constructor pickier about the other arguments. I used to be able to construct a WorkArea using 0 for the last 3 arguments. But 0 is Int while the parameters are declared as UInt; when I made the type of evaluator a parameter the type mismatch on the last 3 arguments became a problem: invalid type signature. I changed to properly typed arguments.
`

DNF · January 24, 2023, 8:10am

That type annotations in functions are generally irrelevant for performance.
That (concrete or parametric) type annotations in struct definitions are crucial for performance.
That the compiler does not have access to an instance of a type during compilation. Only the type tag and the source code definition of the type.

Points 1 and 2 are consequences of point 3.

As far as I can tell, you are implicitly assuming a violation of point 3. You are assuming that the compiler has access to an instance of WorkArea and can inspect the types of its field members. On the contrary: only the type tag, WorkArea is available.

Think of it like this: If I only give you this piece of information: “wa has type WorkArea”. And then ask: “What is the type of wa.evaluator.σ ?”

That’s all you get. Can you figure it out? I cannot, and neither can the compiler. (You would perhaps say: “That depends on the type of the field evaluator.” But you do not know that type, because it is not encoded in the type tag WorkArea, nor is it concretely specified in the source code definition of WorkArea.)

If, on the other hand, you have this information: “wa has type WorkArea{LogisticSimpleEvaluator{Float64}}”, and then ask for the type of wa.evaluator.σ, you can reason in a straightforward manner:

In a WorkArea{LogisticSimpleEvaluator{Float64}}, the field evaluator has type LogisticSimpleEvaluator{Float64}. And in that type the fields σ and k have type Float64. Now, it is clear to the compiler what the input type to the call logistic(z*ev.σ + ev.k) is.

DNF · January 24, 2023, 8:29am

Default type constructors will normally convert input arguments to the correct type (if possible):

julia> struct A
           a::Float64
           b::UInt
           c::UInt
       end

julia> A(2,0,0)  # works
A(2.0, 0x0000000000000000, 0x0000000000000000)

julia> A(2,0,0.1)  # fails
ERROR: InexactError: UInt64(0.1)

Probably, you have defined new internal constructors that override the default one. I cannot say for sure without seeing the type definition with constructors.

lmiq · January 24, 2023, 12:02pm

I would only add that there is one way around that, if you need to keep the fields abstract for some reason, which is to use function barriers:

julia> struct A
           x
       end

julia> f(a) = sum(2 * a.x[i] for i in eachindex(a.x))
f (generic function with 1 method)

julia> g(x) = sum(2 * x[i] for i in eachindex(x))
g (generic function with 1 method)

julia> a = A(rand(1000));

julia> @btime f($a)
  35.933 μs (3491 allocations: 54.56 KiB)
994.9618445503377

julia> @btime g($(a.x))
  842.268 ns (0 allocations: 0 bytes)
994.9618445503377

Meaning, pass the value of the fields directly to the functions, instead of their container. Then the functions will specialize to the types of the fields. That can solve the performance issue if the fields are concrete.

Ross_Boylan · January 24, 2023, 10:34pm

Reinforcing that, I tried removing the 3 local declarations, with types, at the top of the code for zSQdensity. The number of allocations did not change.

Topic		Replies	Views
Inappropriate memory allocation? Julia at Scale question , memory	24	2190	September 4, 2017
Weird memory allocation General Usage	24	1990	May 20, 2017
Reducing allocations to optimize (to match Java speed) Performance	22	1376	December 10, 2018
Memory allocations are not very understandable Performance	26	1982	December 12, 2017
Boltzmann Factor in a loop Performance	38	1851	April 3, 2018

Mysterious memory allocation

Related topics