When is it better to use non-concrete types in structs?

I recently read the excellent post by @quinnj :

I was surprised to learn that DataFrames.jl uses structs with AbstractArrays fields to cut down in compilation time without hurting runtime performance. This is quite a subversive idea to me. I’ve learned to make all my code inferable and my structs concrete, but it seems there is a brand new world beyond this. Is there any good post somewhere on when it is good to do this, and how to do it efficiently?

2 Likes

You don’t really have to worry about this. For DataFrames, they can have arbitrary many columns and encoding the types of each column into the .column field for a huge number of types will just make the compiler blow up.

This rarely happens when “manually” writing code, only when dynamically specializing on things (like specializing on columns from reading an CSV).

2 Likes

Thanks for the input @kristoffer.carlsson. So just to be clear, imagine I have a type with “many” (say, seven or eight) parameters, perhaps even complex parametric types as parameters… You are saying that it is always preferable to keep them all than to make some/all fields ::Any? Is there not some threshold for advisable type complexity? (I’m guessing the answer is a matter balancing compile-time and runtime performance, but I would appreciate any expert input since I have never actually experimented with this)

It’s not always better, and it won’t necessarily always compile faster too, but in the large limit the compiler will explode with too much type information. If you expect to have hundreds of columns with possibly different types and want df.col_i to be type-stable for every column, this is a case that can explode. There’s a really large design space here:

  1. You can use Vector{AbstractVector} to store it all, with the instability. Vector{Any} is probably nicer to the compiler.
  2. You can use NamedTuple{...} to store it all type-stable, but hopefully not get too many columns. Also, switching columns would require you to change your dataframe type.
  3. You can use Vector{Union{Few,Choices,Available}}, and only allow a few choices, but specialize on them. This is essentially what non-Julia packages are doing.

And you can probably find a few other setups for column-based DF storage, each with their own pros and cons.

  • The pro of (1) is that, if you use function-barriers for your calculations, the runtime cost isn’t all that much. The con is that naive code that just loops over values without going through a barrier will be painfully slow.
  • The pro of (2) is that you don’t have to worry: everything is type-stable. But the con of (2) is that changing column names around can be a pain since it’s now inherently non-mutable, and also too many columns will give the compiler some issues.
  • The pro of (3) is that everything will specialize well at compile time, and you can have a ton of columns like (1), but the con is that you have chosen the column types that are allowed and it’s inflexible to other choices.

It would be interesting if there was more local control over the compiler heuristics to try a version of 3 with larger unions.

4 Likes

I’ve heard it said that julia users progress through stages as they learn

  1. All Abstract typed fields: after all it is better than concrete tyeps since more flexible
  2. All fields being type-parameterized: now it is always concrete so optimizer can work
  3. Some Abstract, Some Type-param: when the bulk of work is behind function barrier taking field (not struct) as input it doesn’t matter

In all cases there are some concrete, for things like bools and ints at least.
Also possible a point 0 where everything is concrete.

3 Likes

Unless you have compile time problems then yes.

1 Like

For mutable structs, having them abstract (inc ::Any) also has the advantage of letting you assign something of a different type to them than what they were constructed with.

Thanks to all for the pointers, very instructive. I’ll summarize here what I’ve gathered so far from this. Do correct me if I got anything wrong.

  1. Non-concrete fields in structs is a valid strategy to deal with a problem of poor compile performance specifically due to combinatoric explosion of type parameter values
  2. Therefore having a humungous number of arguments in a type is not a real problem. Rather, it is having them take an exponentially large number of combinations (or having the compiler think they may).
  3. In such case, leaving fields as abstract as possible (::Any) will probably make the compiler sweat less than narrowing them down (to e.g. `::AbstractArray’)
  4. One may also try fields as Unions of a finite set of concrete options
  5. Runtime performance will never be better using this strategy (duh!)
  6. Runtime overhead, however, can be minimized by placing function barriers (i.e. pass fields with a specific runtime type, instead of the whole struct, through a function call), as early as possible in a call stack so that instabilities do not propagate
  7. A question to keep in mind when doing this is: will the function barrier need to be recompiled for an exponentially large number of signature types? If not, problem solved
  8. For mutable structs abstract fields give an added advantage: one can actually mutate the struct even if one needs to change the runtime type of fields. This can be a huge advantage in some situations, like e.g. in DataFrames when exchanging columns or changing their name
1 Like

I wonder if you are aware of

https://docs.julialang.org/en/v1/manual/performance-tips/

Yes, of course. The point is that this strategy goes against some of those recommendations, but it is still valid in some scenarios (to my surprise!)

combinatorial explosition of types is mentioned in the context of

https://docs.julialang.org/en/v1/manual/performance-tips/#The-dangers-of-abusing-multiple-dispatch-(aka,-more-on-types-with-values-as-parameters)-1

but I agree that some additional notes about it would be useful.

1 Like

I am not convinced of this. Maybe, maybe not.
I suspect they are equal.

2 Likes