When is it better to use non-concrete types in structs?

pablosanjose · September 22, 2019, 11:24am

I recently read the excellent post by @quinnj :

I was surprised to learn that DataFrames.jl uses structs with AbstractArrays fields to cut down in compilation time without hurting runtime performance. This is quite a subversive idea to me. I’ve learned to make all my code inferable and my structs concrete, but it seems there is a brand new world beyond this. Is there any good post somewhere on when it is good to do this, and how to do it efficiently?

kristoffer.carlsson · September 22, 2019, 11:36am

You don’t really have to worry about this. For DataFrames, they can have arbitrary many columns and encoding the types of each column into the .column field for a huge number of types will just make the compiler blow up.

This rarely happens when “manually” writing code, only when dynamically specializing on things (like specializing on columns from reading an CSV).

pablosanjose · September 22, 2019, 1:52pm

Thanks for the input @kristoffer.carlsson. So just to be clear, imagine I have a type with “many” (say, seven or eight) parameters, perhaps even complex parametric types as parameters… You are saying that it is always preferable to keep them all than to make some/all fields ::Any? Is there not some threshold for advisable type complexity? (I’m guessing the answer is a matter balancing compile-time and runtime performance, but I would appreciate any expert input since I have never actually experimented with this)

ChrisRackauckas · September 22, 2019, 2:09pm

It’s not always better, and it won’t necessarily always compile faster too, but in the large limit the compiler will explode with too much type information. If you expect to have hundreds of columns with possibly different types and want df.col_i to be type-stable for every column, this is a case that can explode. There’s a really large design space here:

You can use Vector{AbstractVector} to store it all, with the instability. Vector{Any} is probably nicer to the compiler.
You can use NamedTuple{...} to store it all type-stable, but hopefully not get too many columns. Also, switching columns would require you to change your dataframe type.
You can use Vector{Union{Few,Choices,Available}}, and only allow a few choices, but specialize on them. This is essentially what non-Julia packages are doing.

And you can probably find a few other setups for column-based DF storage, each with their own pros and cons.

The pro of (1) is that, if you use function-barriers for your calculations, the runtime cost isn’t all that much. The con is that naive code that just loops over values without going through a barrier will be painfully slow.
The pro of (2) is that you don’t have to worry: everything is type-stable. But the con of (2) is that changing column names around can be a pain since it’s now inherently non-mutable, and also too many columns will give the compiler some issues.
The pro of (3) is that everything will specialize well at compile time, and you can have a ton of columns like (1), but the con is that you have chosen the column types that are allowed and it’s inflexible to other choices.

It would be interesting if there was more local control over the compiler heuristics to try a version of 3 with larger unions.

oxinabox · September 22, 2019, 2:15pm

I’ve heard it said that julia users progress through stages as they learn

All Abstract typed fields: after all it is better than concrete tyeps since more flexible
All fields being type-parameterized: now it is always concrete so optimizer can work
Some Abstract, Some Type-param: when the bulk of work is behind function barrier taking field (not struct) as input it doesn’t matter

In all cases there are some concrete, for things like bools and ints at least.
Also possible a point 0 where everything is concrete.

kristoffer.carlsson · September 22, 2019, 2:30pm

Unless you have compile time problems then yes.

oxinabox · September 22, 2019, 2:41pm

For mutable structs, having them abstract (inc ::Any) also has the advantage of letting you assign something of a different type to them than what they were constructed with.

pablosanjose · September 23, 2019, 5:24am

Thanks to all for the pointers, very instructive. I’ll summarize here what I’ve gathered so far from this. Do correct me if I got anything wrong.

Non-concrete fields in structs is a valid strategy to deal with a problem of poor compile performance specifically due to combinatoric explosion of type parameter values
Therefore having a humungous number of arguments in a type is not a real problem. Rather, it is having them take an exponentially large number of combinations (or having the compiler think they may).
In such case, leaving fields as abstract as possible (::Any) will probably make the compiler sweat less than narrowing them down (to e.g. `::AbstractArray’)
One may also try fields as Unions of a finite set of concrete options
Runtime performance will never be better using this strategy (duh!)
Runtime overhead, however, can be minimized by placing function barriers (i.e. pass fields with a specific runtime type, instead of the whole struct, through a function call), as early as possible in a call stack so that instabilities do not propagate
A question to keep in mind when doing this is: will the function barrier need to be recompiled for an exponentially large number of signature types? If not, problem solved
For mutable structs abstract fields give an added advantage: one can actually mutate the struct even if one needs to change the runtime type of fields. This can be a huge advantage in some situations, like e.g. in DataFrames when exchanging columns or changing their name

Tamas_Papp · September 23, 2019, 6:18am

I wonder if you are aware of

https://docs.julialang.org/en/v1/manual/performance-tips/

pablosanjose · September 23, 2019, 6:28am

Yes, of course. The point is that this strategy goes against some of those recommendations, but it is still valid in some scenarios (to my surprise!)

Tamas_Papp · September 23, 2019, 6:40am

combinatorial explosition of types is mentioned in the context of

https://docs.julialang.org/en/v1/manual/performance-tips/#The-dangers-of-abusing-multiple-dispatch-(aka,-more-on-types-with-values-as-parameters)-1

but I agree that some additional notes about it would be useful.

oxinabox · September 23, 2019, 8:00am

I am not convinced of this. Maybe, maybe not.
I suspect they are equal.

Topic		Replies	Views
Are fields with structtypes to be avoided as well? Performance	3	488	February 12, 2020
Field types New to Julia struct	29	898	May 20, 2022
Composing types with non-concrete fields Internals & Design question	1	535	May 3, 2018
Improving performance due to setting types of the fields of a struct Performance	10	1174	July 12, 2019
Parametric Structs New to Julia performance , parametric-types	10	276	July 23, 2025

When is it better to use non-concrete types in structs?

Related topics