Memory allocations after reading arrays from .mat file

Dear all,

I have written script that reads arrays from a .mat file and allocated new arrays based on the size of the arrays read from the .mat file.
The issue is that when the new arrays are allocated using the size of read arrays, operations that are applied to these arrays cause allocations. This is not the case if the size of the new arrays is determined independently of the read arrays.

I could repeat this behaviour in a MWE:

sing MAT

function GenerateDummyData()
    a0   = zeros(400)
    file = matopen(string(@__DIR__, "/DummyData.mat"), "w")
    write(file, "a0", a0)
    close(file)
end

function maintest(n)
    #------------#
    file = matopen( string(@__DIR__, "/DummyData.mat") )
    a0   = Array(read(file, "a0"))
    close(file)
    #------------#
    array_length = length(a0)
    # array_length = 400
    a1 = zeros(array_length)
    a2 = zeros(array_length)
    @time for i=1:n
        for j in eachindex(a1)
            a1[j] = a2[j]
        end
    end
end

GenerateDummyData()
maintest(1)
maintest(10)
maintest(100)

Here the size of the new arrays (array_length) is derived from the read arrays (array_length = length(a0)).
This causes allocations that scale with n, the number of sweeps through the new arrays:

  0.000245 seconds (800 allocations: 18.734 KiB)
  0.000726 seconds (8.00 k allocations: 187.344 KiB)
  0.045652 seconds (80.00 k allocations: 1.830 MiB, 67.29% gc time)

This is in contrast with setting array_length = 400 (same size as read array but set independently), which causes no allocation:

  0.000001 seconds
  0.000005 seconds
  0.000043 seconds

I wonder what is the cause (and remedy) for this behaviour, would anyone have a hint?

You can use the @code_warntype macro to pin down type instabilities. When you run this in a REPL you get colored output showing you where Julia was unable to infer a concrete type in that call. Usually, you should aim to avoid all red lines in that output. Applying it to this example gave me the following output (unfortunately without colors here):

julia> @code_warntype maintest(1)                                                                                                                                                                                                
MethodInstance for maintest(::Int64)                                                                                                                                                                                             
  from maintest(n) in Main at ....                                                                                                                                                   
Arguments                                                                                                                                                                                                                        
  #self#::Core.Const(maintest)                                                                                                                                                                                                   
  n::Int64                                                                                                                                                                                                                       
Locals                                                                                                                                                                                                                           
  has_msg::Bool                                                                                                                                                                                                                  
  _msg::Nothing                                                                                                                                                                                                                  
  diff::Base.GC_Diff                                                                                                                                                                                                             
  @_6::Union{Nothing, Tuple{Int64, Int64}}                                                                                                                                                                                       
  val::Nothing                                                                                                                                                                                                                   
  compile_elapsedtimes::Tuple{UInt64, UInt64}                                                                                                                                                                                    
  elapsedtime::UInt64                                                                                                                                                                                                            
  stats::Base.GC_Num                                                                                                                                                                                                             
  a2::Any                                                                                                                                                                                                                        
  a1::Any                                                                                                                                                                                                                        
  array_length::Any                                                                                                                                                                                                              
  a0::Any                                                                                                                                                                                                                        
  file::Union{MAT.MAT_HDF5.MatlabHDF5File, MAT.MAT_v5.Matlabv5File}                                                                                                                                                              
  @_16::Any                                                                                                                                                                                                                      
  i::Int64                                                                                                                                                                                                                       
  j::Any                                                                                                                                                                                                                         
  @_19::Int64                                                                                                                                                                                                                    
  @_20::Nothing                                                                                                                                                                                                                  
Body::Nothing                                                                                                                                                                                                                    
1 ──       Core.NewvarNode(:(has_msg))                                                                                                                                                                                           
│          Core.NewvarNode(:(_msg))                                                                                                                                                                                              
│          Core.NewvarNode(:(diff))                                                                                                                                                                                              
│          Core.NewvarNode(:(@_6))                                                                                                                                                                                               
│          Core.NewvarNode(:(val))                                                                                                                                                                                               
│    %6  = Main.string("....", "/DummyData.mat")::String                                                                                                                                               
│          (file = Main.matopen(%6))                                                                                                                                                                                             
│    %8  = Main.read(file, "a0")::Any                                                                                                                                                                                            
│          (a0 = Main.Array(%8))                                                                                                                                                                                                 
│          Main.close(file)                                                                                                                                                                                                      
│          (array_length = Main.length(a0))                                                                                                                                                                                      
│          (a1 = Main.zeros(array_length))                                                                                                                                                                                       
│          (a2 = Main.zeros(array_length))                                                                                                                                                                                       
│          $(Expr(:meta, :force_compile))                                                                                                                                                                                        
│          (stats = Base.gc_num())                                                                                                                                                                                               
│          (elapsedtime = Base.time_ns())                                                                                                                                                                                        
│          Base.cumulative_compile_timing(true)                                                                                                                                                                                  
└───       (compile_elapsedtimes = Base.cumulative_compile_time_ns())                                                                                                                                                            
2 ──       $(Expr(:enter, #11))                                                                                                                                                                                                  
3 ──       (@_19 = -1)                                                                                                                                                                                                           
│    %21 = (1:n)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])                                                                                                                                                
│          (@_6 = Base.iterate(%21))                                                                                                                                                                                             
│    %23 = (@_6 === nothing)::Bool                                                                                                                                                                                               
│    %24 = Base.not_int(%23)::Bool                                                                                                                                                                                               
└───       goto #9 if not %24                  

If you study the above (with colors) you will see that the first appearance of a type instability is from the line a0 = Array(read(file, "a0")). (I think) This is because read is type instable, because it cannot infer the type of its return value from its arguments alone.

Changing the line to a0 = read(file, "a0")::Vector{Float64} cures the problem in this case, because it promises to Julia that the returned value will be a Vector{Float64}, which is a concrete type. If read does not return a Vector{Float64}, then this line will throw an error, hence, it acts like an assertion.

However, to my own surprise, using a0 = Vector{Float64}(read(file, "a0")) does not help here, although a0 is then also of type Vector{Float64}. Perhaps somebody else can explain why that is the case…

Hope that helps.

1 Like

Thanks a lot! Specifying type with ::Vector{Float64} does solve the issue, at least in the MWE. I will definitely use @code_warntype in the future, thanks for the hint!
What I find puzzling is that if you one checks typeof(a0) without additional type specification it already returns Vector{Float64}. So why is the additional type specification is needed? Also, what is weird is that the allocations appear when touching arrays that have the size of a0 (not directly touching a0). So it seems that just using the size of a “non-properly typed” array (instead of the array itself) is troublesome…

Yes, but this type is only known at run-time. The compiler cannot predict what, exactly, typeof(a0) will be, hence the type instability.

1 Like

OK, so if I understand well, it would a good practice to type annotate every array which is read from external files (using MAT.jl, CSV.jl etc…)?

Type annotation only works if there is truly only ever one type coming out of the read operation. Imagine you could read Float64 or Float32 or Int vectors, then type annotation wouldn’t help you.

The trick in such situations is to just not let the type instability proliferate further than it needs to.
You have one function that does both a read and also does a computation with the read values. By splitting the computation out, you are making a “function barrier”. When that function is called, even though it has to be determined dynamically which method is needed (because the array type is unknown before runtime) the function will be compiled for that specific type the first time a new type appears.
As the function barrier is called only rarely (just once in your case) its dynamic lookup overhead doesn’t matter (it’s somewhere on the order of nano to microseconds).

So the rule of thumb is, split your functions up to avoid overly large impact of unavoidable type instabilities, and don’t mind type instabilities if the code they affect is short or only executed a couple of times.

function maintest(n)
    #------------#
    file = matopen( string(@__DIR__, "/DummyData.mat") )
    a0   = Array(read(file, "a0"))
    close(file)
    # at this point in time, a0 has a concrete value, although it wasn't known to the compiler beforehand
    # so now we call the rest of the logic as a separate function
    rest(n, a0)
end

function rest(n, a0)
    # for this function, n and a0 will always have concrete types
    # so it doesn't suffer from the read type instability
    array_length = length(a0)
    # array_length = 400
    a1 = zeros(array_length)
    a2 = zeros(array_length)
    @time for i=1:n
        for j in eachindex(a1)
            a1[j] = a2[j]
        end
    end
end
5 Likes

Thank you @fatteneder, @DNF and @jules. This really helped a lot !