In place Cholesky solve with zero allocation (simplical vs. supernodal)

I am unable to avoid allocation when performing an in-place solve against a simplical Cholesky factorisation, but have no problems with a supernodal factorisation.

Note that:

  • the allocation is small, but since I am performing many millions of solves (as part of an eigenproblem using @stabbles excellent ArnoldiMethod.jl package, this adds up to hundreds of Gb with associated GC overhead;
  • simply avoiding the problem using the supernodal factorisation is suboptimal, since the size and sparsity of the matrices is such that the simplical factorisation provides a much faster solve.

I understand that CHOLMOD hooks into Julia’s allocation routines, so I am assuming that the allocations are internal to CHOLMOD, and thus this might have to be a question or upstream. However I thought I would ask here first as I am touching the internals of various stdlibs in order to call the cholmod_solve2 interface without moving between Julia and CHOLMOD representations of the various matrices; there could easily be a Julia side implementation detail that I am missing, perhaps relating to common structure. As such, any thoughts from those who’ve touched the core implementation of cholmod.jl, e.g. @sacha, @andreasnoack would be appreciated.

The following code demonstrates the problem (much of this is hacked from the SuiteSparse stdlib).

using SparseArrays
using LinearAlgebra
using SuiteSparse
using BenchmarkTools

import SuiteSparse.CHOLMOD: Sparse, Dense, Factor, C_Dense, C_Factor, C_Sparse, SuiteSparse_long
import SuiteSparse.CHOLMOD: common_struct, common_supernodal, common_nmethods, change_factor!, set_print_level, defaults
import SuiteSparse.CHOLMOD: VTypes, @cholmod_name, fact_, cholesky!, CHOLMOD_A
import SparseArrays: getcolptr, getrowval

function supercholesky(A::Sparse; shift::Real=0.0, check::Bool = true,
  perm::Union{Nothing,AbstractVector{SuiteSparse_long}}=nothing)
  
  cm = defaults(common_struct[Threads.threadid()])
  set_print_level(cm, 0)
  
  # Force a supernodal solution (eliminates alloc on solve)
  unsafe_store!(common_supernodal[Threads.threadid()], 2)
  
  # Compute the symbolic factorization
  F = fact_(A, cm; perm = perm)
  
  # Compute the numerical factorization
  cholesky!(F, A; shift = shift, check = check)
  
  return F
  
end

function _solve!(X::Dense{Tv}, Yref::Ref{Ptr{C_Dense{Tv}}}, Eref::Ref{Ptr{C_Dense{Tv}}}, F::Factor{Tv}, B::Dense{Tv}) where Tv<:VTypes
  
  # Pointer to pre-allocated dense matrix
  Xref = Ptr{C_Dense{Tv}}(pointer(X))
  sys = CHOLMOD_A # Solve type
  Bset = C_NULL   # Unused parameter
  Xset = C_NULL   # Unused parameter
  
  if size(F,1) != size(B,1)
    throw(DimensionMismatch("LHS and RHS should have the same number of rows. " *
    "LHS has $(size(F,1)) rows, but RHS has $(size(B,1)) rows."))
  end
  
  if !issuccess(F)
    s = unsafe_load(pointer(F))
    if s.is_ll == 1
      throw(LinearAlgebra.PosDefException(s.minor))
    else
      throw(LinearAlgebra.ZeroPivotException(s.minor))
    end
  end
  
  res = ccall((@cholmod_name("solve2"), :libcholmod), Cint,
  # Input                                                        # Output                                         # Workspace
  (Cint, Ptr{C_Factor{Tv}}, Ptr{C_Dense{Tv}}, Ptr{C_Sparse{Tv}}, Ref{Ptr{C_Dense{Tv}}},  Ref{Ptr{C_Sparse{Tv}}},  Ref{Ptr{C_Dense{Tv}}},  Ref{Ptr{C_Dense{Tv}}}, Ptr{UInt8}),
  sys,   F,                   B,              Bset,              Xref,                   Xset,                    Yref,                   Eref,                  SuiteSparse.CHOLMOD.common_struct[Threads.threadid()])
  
  if(res != 1)
    throw(ErrorException("CHOLMOD solve failure"))
  end
  
  return 
  
end


T = Float64
N = 300

X = Dense(Array{T}(undef, N))
B = Dense(rand(N))
Yref = Ref(Ptr{C_Dense{T}}(C_NULL))
Eref = Ref(Ptr{C_Dense{T}}(C_NULL))

A = sprand(N, N, 0.005)
A = A + A' + N*I
As = Sparse(A)

F_simple = cholesky(A)
F_super =  supercholesky(As);

_solve!(X, Yref, Eref, F_simple, B);
@show F_simple
A\B ≈ X ? println("Simplical pass") : error("Simplical fail")

_solve!(X, Yref, Eref, F_super, B);
@show F_super
A\B ≈ X ? println("Supernodal pass") : error("Supernodal fail")

@benchmark _solve!($X, $Yref, $Eref, $F_simple, $B)

@benchmark _solve!($X, $Yref, $Eref, $F_super, $B)

Simplical result:

@benchmark _solve!($X, $Yref, $Eref, $F_simple, $B)
BenchmarkTools.Trial:
  memory estimate:  9.46 KiB
  allocs estimate:  2
  --------------
  minimum time:     4.200 μs (0.00% GC)
  median time:      4.443 μs (0.00% GC)
  mean time:        4.476 μs (0.72% GC)
  maximum time:     55.157 μs (82.83% GC)
  --------------
  samples:          10000
  evals/sample:     7

Supernodal result:

 @benchmark _solve!($X, $Yref, $Eref, $F_super, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.299 μs (0.00% GC)
  median time:      18.800 μs (0.00% GC)
  mean time:        19.150 μs (0.00% GC)
  maximum time:     88.299 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I don’t see why the default version would cause allocation. I suspect it must be happening inside CHOLMOD but not sure why it would be but it would be good if you could figure out exactly where it happens.

Thanks @andreasnoack. Indeed, according to the CHOLMOD user manual it should not allocate once the workspace Yref and Eref have been initialised on first call. I believe there are functions in the library to return the number of allocations performed, so I will see if I narrow this down further by this route.

The cholmod_common structure contains some pertinent fields, including malloc_count which tracks the cumulative malloc - free, and the size of the internally allocated workspace iworksize, and xworksize, which are what they sound like with one’s Fortran hat on.

Evidently this doesn’t allow one to determine if memory has been allocated and released during the call. All I am able to say is that by running the following function with the default factorisation F_simple as the input F (which appears to cause allocation according to BenchmarkTools), the workspace has not grown. Benchmarking the same call gives 5002 allocations, which (considering initialisation, the copies and the sum) gives two allocations per call to _solve! as previously found.

I’m not sure how best to proceed. Given that we provide CHOLMOD a pointer to Julia’s malloc, I suppose I could swap that out for a different function which logs calls before allocating. This would at least prove that the allocation is happening within CHOLMOD. But it feels as though I need to reproduce this in C to make progress. Reticent because… two language problem!


function changingman(X, Yref, Eref, F, B)

  # Allow CHOLMOD to do its allocations due to change of Factor
  _solve!(X, Yref, Eref, F, B);

  for i in 1:1000

    c1 = copy(SuiteSparse.CHOLMOD.common_struct[Threads.threadid()])
    _solve!(X, Yref, Eref, F_simple, B)
    c2 = copy(SuiteSparse.CHOLMOD.common_struct[Threads.threadid()])

    if sum(c2-c1) > 0
      error("common struct has changed")
    end
  
  end

end

Yeah. I agree that it might be a good idea to reproduce the allocation in C at this stage.

I replaced CHOLMOD’s allocation routines with a self-managed buffer and can confirm that CHOLMOD is allocating. And yes, I am cognisant of the sanity (or lack thereof) of this debugging approach, I blame the ease of Julia’s C FFI!

cholmodptrs = Vector{Ptr{Cvoid}}()
cholmodheap = Vector{Vector{UInt8}}()

function logmalloc(size::Csize_t)
  mem = Vector{UInt8}(undef, size)
  ptr = Ptr{Cvoid}(pointer(mem))
  push!(cholmodheap, mem)
  push!(cholmodptrs, ptr)
  println("CHOLMOD requests malloc of $size bytes, given $ptr")
  return ptr
end

function logfree(ptr::Ptr{Cvoid})
  i = findall(x->x == ptr, cholmodptrs)
  if length(i) > 0
    sz = length(cholmodheap[i[1]])
    println("CHOLMOD requests free of $sz bytes from pointer $ptr")
  else
    println("CHOLMOD requests free of unmanaged pointer")
  end
  return nothing
end

logmalloc_ptr = @cfunction(logmalloc, Ptr{Cvoid}, (Csize_t,))
logfree_ptr = @cfunction(logfree, Cvoid, (Ptr{Cvoid},))

cnfg = cglobal((:SuiteSparse_config, :libsuitesparseconfig), Ptr{Cvoid})
unsafe_store!(cnfg, cglobal(logmalloc_ptr, Ptr{Cvoid}), 1)
unsafe_store!(cnfg, cglobal(logfree_ptr, Ptr{Cvoid}), 4)

This shows that:

julia> _solve!(X, Yref, Eref, F_simple, B);
CHOLMOD requests free of unmanaged pointer
CHOLMOD requests free of unmanaged pointer
CHOLMOD requests malloc of 56 bytes, given Ptr{Nothing} @0x0000000118c5c200
CHOLMOD requests malloc of 9600 bytes, given Ptr{Nothing} @0x00007fcff2a04c80

julia> _solve!(X, Yref, Eref, F_simple, B);
CHOLMOD requests free of 9600 bytes from pointer Ptr{Nothing} @0x00007fcff2a04c80
CHOLMOD requests free of 56 bytes from pointer Ptr{Nothing} @0x0000000118c5c200
CHOLMOD requests malloc of 56 bytes, given Ptr{Nothing} @0x0000000118c5c4d0
CHOLMOD requests malloc of 9600 bytes, given Ptr{Nothing} @0x00007fcff2af7680

Thus on each call a (perfectly valid AFAIK) buffer is released and reallocated. I note that 9600 bytes is (4 x sizeof(double) x N) where N is the size of the RHS in the solve. Off to read CHOLMOD’s source code to determine where this is happening.

1 Like

Okay, I think I found out what is happening: https://github.com/DrTimothyAldenDavis/SuiteSparse/issues/45

Given the above, and in this specific case, one can avoid allocation by manipulating the workspace array after the call, such that CHOLMOD does not reallocate it needlessly.

Inserting the following within the solve function:

Y = unsafe_load(Yref[])
nrow = Int(Y.nzmax / Y.ncol)
d = nrow
unsafe_store!(Yref[], C_Dense{Float64}(nrow, Y.ncol, Y.nzmax, d, Y.x, Y.z, Y.xtype, Y.dtype))

Yields the expected behaviour:

julia> @benchmark  _msolve!($X, $Yref, $Eref, $F_simple, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.386 μs (0.00% GC)
  median time:      4.671 μs (0.00% GC)
  mean time:        4.795 μs (0.00% GC)
  maximum time:     20.474 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

Note that this is on a different machine, absolute timing is also slightly faster.