Why is this type unstable?

using Glob, NCDatasets

const colpath="path_to_data"

ncload(ds, ::Type{Array{T,N}}) where {T,N} = ds["T"][:,:,:,:]::Array{T,N}

function foo()
    files=glob("*.??????.nc4",colpath)
    
    NCDataset(files[1]) do ds
        ncv=ds["T"]
        nctype=eltype(ncv)
        ncdims=size(ncv)
        @show nctype ncdims
        ncload(ds,Array{nctype,length(ncdims)})
    end
end

@code_warntype foo()
MethodInstance for foo()
  from foo() @ Main ~/tmp/julia/foo.jl:7
Arguments
  #self#::Core.Const(foo)
Locals
  #10::var"#10#11"
  files::Vector{String}
Body::Array
1 ─      (files = Main.glob("*.??????.nc4", Main.colpath))
│        (#10 = %new(Main.:(var"#10#11")))
│   %3 = #10::Core.Const(var"#10#11"())
│   %4 = Base.getindex(files, 1)::String
│   %5 = Main.NCDataset(%3, %4)::Array
└──      return %5

It knows that return value is Array, but does not know its dimensionality and element type, although by the time ncload is called, both nctype and ncdims are already defined. If I call ncload as

ncload(ds,Array{Union{Missing,Float32},4})

then @code_warntype shows that everything is type stable

@code_warntype foo()
MethodInstance for foo()
  from foo() @ Main ~/tmp/julia/foo.jl:7
Arguments
  #self#::Core.Const(foo)
Locals
  #12::var"#12#13"
  files::Vector{String}
Body::Array{Union{Missing, Float32}, 4}
1 ─      (files = Main.glob("*.??????.nc4", Main.colpath))
│        (#12 = %new(Main.:(var"#12#13")))
│   %3 = #12::Core.Const(var"#12#13"())
│   %4 = Base.getindex(files, 1)::String
│   %5 = Main.NCDataset(%3, %4)::Array{Union{Missing, Float32}, 4}
└──      return %5

I want to be able to get the type and size of data from a dataset and pass them to ncload function and get a type stable result, which is efficiently allocated as a single chunk in memory. What I am doing wrong?

My guess is that because nctype can be any type and length(ncdims) can be any integer >= 0, the compiler cannot guarantee a particular specialized Array type.

From the Performance Tips sections of NCDatasets.jl:

Reading data from a file is not type-stable, because the type of the output of the read operation does depedent on the type defined in the NetCDF files and the value of various attribute (like scale_factor, add_offset and units for time conversion). All this information cannot be inferred from a static analysis of the source code. It is therefore recommended to use type annotation if resulting type of a read operation in known:

Yes, the only way to make things type stable is to harcode type annotations, which I do not want to do. But I don’t understand the difference between this

ncload(ds,Array{nctype,length(ncdims)})

and this

ncload(ds,Array{Union{Missing,Float32},4})

I am passing the same values to ncload. Should not compiler derive types from the values of arguments? Or does this mean that type stability of foo and type stability of ncload are different things?

For type stability, you have to be able to determine the type at compile time, before you know what is in the file.

Looks like this can’t be avoided. If I am doing a computationally heavy analysis, how important would be to be type stable? Should I hard code data types each time, or just let compiler do with data what it wants?

Search for “Function Barrier” in the Julia docs (id link, but I’m on my phone.

3 Likes

Yes, I’ve read about function barriers. I am trying to implement one. My concern is that whatever arrays I pass to function barrier should be continuously allocated in memory.

The trick here is to manage type instability at the function call barrier.

Notice that ncload here is type stable after foo dynamically dispatches to it. The return type of ncload can be determined from the types of its arguments.

I would split my code into functions where I do not know the type and functions where I do know the type.

1 Like

So, if I do something like

data=NCDataset(file) do ds
        ds["T"][:,:,:,:] #this is type unstable anyway 
end

result=do_heavy_analysis_type_stable_function(data)

will I be ok in terms of performance? Or do I still need to do

data=NCDataset(file) do ds
        ds["T"][:,:,:,:]::Array{Union{Missing,Float32},4} # this is type stable
end

result=do_heavy_analysis_type_stable_function(data)

each time, if I really care about the performance?

1 Like

The difference should be pretty negligible between the two but the first might be slightly faster. The difference is just the cost of writing an if statement.

3 Likes

Indeed, I would just use the function barrier without cluttering the code with type annotations in most cases. I notice a formatting issue with NCDatasets documentation. I fixed this and expanded slightly this section:
https://alexander-barth.github.io/NCDatasets.jl/dev/performance/

2 Likes

Thanks! This make things quite easy, since computational core is usually enclosed in some sort of function anyway. And thanks for fixing documentation!

1 Like