IndexedTables and type-stability of IndexedTables.groupby

Hi,

I am a bit confused by IndexedTable being referred to as a “type-stable” alternative to DataFrames. If I write a function which does some operations and returns an IndexedTable, the @code_warntype still flags the IndexedTable return variable in red. What am I misunderstanding here?

I am also unsure if common operations on IndexedTables like IndexedTables.groupby etc. are type-stable. Is it possible to assert the type of the output of IndexedTables.groupby if I know the types of each column before-hand. Can someone provide an example?

Sharing an example below of a typical use-case in my setting. Is there any way to make data_1, d and d_1 type-stable? The code warntype result flags them as Any. Type-assertion doesn’t work either and the type IndexedTable is marked as red in @code_warntype.

function test(meta_data::R)::DataFrame where {R<:DataFrame} 
     
     data_1 = table(meta_data)
     d = IndexedTables.groupby(df -> 1 ./ (1 .+ df.eV_observed[1]), data_1, (:std_id, :act_yr, :draws_index, :cmp_id), select = (:V_observed, :eV_observed))
     d_1 = table(IndexedTables.columns(d)..., names = (:std_id, :act_yr, :draws_index, :cmp_id, :Prob), copy = false)
     d_final = DataFrames.DataFrame(d_1)::DataFrame
     return d_final::DataFrame
end

@code_warntype test(meta_data) 

Variables
  #self#::Core.Compiler.Const(prob_stage2_create_groupby_new1, false)
  meta_data::DataFrame
  #3209::getfield(Main, Symbol("##3209#3210"))
  data_1::Any
  d::Any
  d_1::Any
  d_final::DataFrame

Body::DataFrame
1 ─ %1  = Main.DataFrame::Core.Compiler.Const(DataFrame, false)
│         (data_1 = Main.table(meta_data))
│         (#3209 = %new(Main.:(##3209#3210)))
│   %4  = #3209::Core.Compiler.Const(getfield(Main, Symbol("##3209#3210"))(), false)
│   %5  = (:V_observed, :eV_observed)::Core.Compiler.Const((:V_observed, :eV_observed), false)
│   %6  = (:select,)::Core.Compiler.Const((:select,), false)
│   %7  = Core.apply_type(Core.NamedTuple, %6)::Core.Compiler.Const(NamedTuple{(:select,),T} where T<:Tuple, false)
│   %8  = Core.tuple(%5)::Core.Compiler.Const(((:V_observed, :eV_observed),), false)
│   %9  = (%7)(%8)::NamedTuple{(:select,),Tuple{Tuple{Symbol,Symbol}}}
│   %10 = IndexedTables.groupby::Core.Compiler.Const(IndexedTables.groupby, false)
│   %11 = Core.kwfunc(%10)::Core.Compiler.Const(getfield(IndexedTables, Symbol("#kw##groupby"))(), false)
│   %12 = IndexedTables.groupby::Core.Compiler.Const(IndexedTables.groupby, false)
│   %13 = data_1::Any
│   %14 = (:std_id, :act_yr, :draws_index, :cmp_id)::Core.Compiler.Const((:std_id, :act_yr, :draws_index, :cmp_id), false)
│         (d = (%11)(%9, %12, %4, %13, %14))
│   %16 = IndexedTables.columns::Core.Compiler.Const(IndexedTables.columns, false)
│   %17 = (%16)(d)::Any
│   %18 = (:std_id, :act_yr, :draws_index, :cmp_id, :Prob)::Core.Compiler.Const((:std_id, :act_yr, :draws_index, :cmp_id, :Prob), false)
│   %19 = (:names, :copy)::Core.Compiler.Const((:names, :copy), false)
│   %20 = Core.apply_type(Core.NamedTuple, %19)::Core.Compiler.Const(NamedTuple{(:names, :copy),T} where T<:Tuple, false)
│   %21 = Core.tuple(%18, false)::Core.Compiler.Const(((:std_id, :act_yr, :draws_index, :cmp_id, :Prob), false), false)
│   %22 = (%20)(%21)::NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}}
│   %24 = Core.tuple(%22, Main.table)::Core.Compiler.PartialStruct(Tuple{NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}},typeof(table)}, Any[NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}}, Core.Compiler.Const(IndexedTables.table, false)])
│         (d_1 = Core._apply(%23, %24, %17))
│   %26 = DataFrames.DataFrame::Core.Compiler.Const(DataFrame, false)
│   %27 = (%26)(d_1)::Any
│         (d_final = Core.typeassert(%27, Main.DataFrame))
│   %29 = Core.typeassert(d_final, Main.DataFrame)::DataFrame
│   %30 = Base.convert(%1, %29)::DataFrame
│   %31 = Core.typeassert(%30, %1)::DataFrame
└──       return %31

But Ur input is DataFrame which is untyped. You need your inputs to be also an IndexedTable and IndexedTable is not a subtype of DataFrame.

Thanks for the reply, though I think the issue persists. Please see the modified output below. Variables d and d_1 still have red flags whereas d_final (a DataFrame) is coded in blue.

Given that I know the exact types of the columns of the tables d and d_1, I should be able to improve upon the Any type-assertion of d and d_1 but unable to see how to do the same.

function test(meta_data_table::R)::DataFrame where {R<:IndexedTable}

     d = IndexedTables.groupby(df -> 1 ./ (1 .+ df.eV_observed[1]), meta_data_table, (:std_id, :act_yr, :draws_index, :cmp_id), select = (:V_observed, :eV_observed))
     d_1 = table(IndexedTables.columns(d)..., names = (:std_id, :act_yr, :draws_index, :cmp_id, :Prob), copy = false)
     d_final = DataFrames.DataFrame(d_1)::DataFrame
     return d_final::DataFrame
end

@code_warntype test(meta_data_table)
Variables
  #self#::Core.Compiler.Const(test, false)
  meta_data_table::IndexedTable{StructArrays.StructArray{NamedTuple{(:std_id, :act_yr, :draws_index, :draws_q, :cmp_id, :cgpa, :salary, :category_id, :cutoff, :V_observed, :eV_observed),Tuple{Union{Missing, String},Union{Missing, String},Int64,Float64,Int64,Float64,Union{Missing, Float64},Int64,Float64,Float64,Float64}},1,NamedTuple{(:std_id, :act_yr, :draws_index, 
:draws_q, :cmp_id, :cgpa, :salary, :category_id, :cutoff, :V_observed, :eV_observed),Tuple{Array{Union{Missing, String},1},Array{Union{Missing, String},1},Array{Int64,1},Array{Float64,1},Array{Int64,1},Array{Float64,1},Array{Union{Missing, Float64},1},Array{Int64,1},Array{Float64,1},Array{Float64,1},Array{Float64,1}}},Int64}}
  #783::getfield(Main, Symbol("##783#784"))
  d::Any
  d_1::Any
  d_final::DataFrame

Body::DataFrame
1 ─ %1  = Main.DataFrame::Core.Compiler.Const(DataFrame, false)
│         (#783 = %new(Main.:(##783#784)))
│   %3  = #783::Core.Compiler.Const(getfield(Main, Symbol("##783#784"))(), false)
│   %4  = (:V_observed, :eV_observed)::Core.Compiler.Const((:V_observed, :eV_observed), false)
│   %5  = (:select,)::Core.Compiler.Const((:select,), false)
│   %6  = Core.apply_type(Core.NamedTuple, %5)::Core.Compiler.Const(NamedTuple{(:select,),T} where T<:Tuple, false)
│   %7  = Core.tuple(%4)::Core.Compiler.Const(((:V_observed, :eV_observed),), false)
│   %8  = (%6)(%7)::NamedTuple{(:select,),Tuple{Tuple{Symbol,Symbol}}}
│   %9  = IndexedTables.groupby::Core.Compiler.Const(IndexedTables.groupby, false)
│   %10 = Core.kwfunc(%9)::Core.Compiler.Const(getfield(IndexedTables, Symbol("#kw##groupby"))(), false)
│   %11 = IndexedTables.groupby::Core.Compiler.Const(IndexedTables.groupby, false)
│   %12 = (:std_id, :act_yr, :draws_index, :cmp_id)::Core.Compiler.Const((:std_id, :act_yr, :draws_index, :cmp_id), false)
│         (d = (%10)(%8, %11, %3, meta_data_table, %12))
│   %14 = IndexedTables.columns::Core.Compiler.Const(IndexedTables.columns, false)
│   %15 = (%14)(d)::Any
│   %16 = (:std_id, :act_yr, :draws_index, :cmp_id, :Prob)::Core.Compiler.Const((:std_id, :act_yr, :draws_index, :cmp_id, :Prob), false)
│   %17 = (:names, :copy)::Core.Compiler.Const((:names, :copy), false)
│   %18 = Core.apply_type(Core.NamedTuple, %17)::Core.Compiler.Const(NamedTuple{(:names, :copy),T} where T<:Tuple, false)
│   %20 = (%18)(%19)::NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}}
│   %21 = Core.kwfunc(Main.table)::Core.Compiler.Const(getfield(IndexedTables, Symbol("#kw##table"))(), false)
│   %22 = Core.tuple(%20, Main.table)::Core.Compiler.PartialStruct(Tuple{NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}},typeof(table)}, Any[NamedTuple{(:names, :copy),Tuple{NTuple{5,Symbol},Bool}}, Core.Compiler.Const(IndexedTables.table, false)])
│         (d_1 = Core._apply(%21, %22, %15))
│   %24 = DataFrames.DataFrame::Core.Compiler.Const(DataFrame, false)
│   %25 = (%24)(d_1)::Any
│         (d_final = Core.typeassert(%25, Main.DataFrame))
│   %27 = Core.typeassert(d_final, Main.DataFrame)::DataFrame
│   %28 = Base.convert(%1, %27)::DataFrame
│   %29 = Core.typeassert(%28, %1)::DataFrame
└──       return %29

I see. I think it’s not possible to infer types in this case. Becauset he by variables is passed in as tuple as it cannot dipstach on the actualy type of :column

I see. This definitely helps clarify. Is there a simple example where it is possible to infer types while using IndexedTables.groupby?

I think a good question to understand is why do you want to infer types? Sometimes, code that may appear type-instable can still be efficient as the key pieces of code that need to be fast run over a function boundary.

E.g.

a(df) = begin
  sum(df.a), sum(df.b)
end

using DataFrame
df = DataFrame(a = rand(100_000_000), b = rand(100_000_000))

@time a(df)
@time a(df)

@code_warntype a(df)

You should be happy with the speed even though type inference isn’t possible. It’s still fast because sum is fast. sum knows the type of df.a and df.b at the time of runnning sum(df.a) but not when analysing a(). And calling sum is called the function boundary.

Yes, that makes sense. I am happy with the speeds especially since the IndexedTables.groupby seems to be the go-to method for doing many groupby operations (which is my situation) following Group-by performance benchmarks and recommendations

The above example is useful but probably not a good analogy for problems involving groupby in practice where the operations are often cumbersome and not as simple as just summing over a column. But, if you’re saying I don’t really need to worry too much about the return-type of IndexedTables.groupby, then it’s fine.

That’s why I am saying. So don’t feel that type-inference fails == unoptimised speed. Taht’s not true of group_by unless your group_by is very complicated.

Thanks, yes that makes sense and makes me worry less.

Out of curiosity, what is the “type-stability” of IndexedTables referred to here mean then? Does the concept of type-stability apply to DataFrames or Tables?

I asked that when I didn’t understand the function boundary. I thought type stability was needed for speed. Not so as I explained above