How to optimize for processing over array of heterogeneous information?

Consider a simple example where I perform something inside an array A based on some co-ordinate information (abstract type CoOrdinateInfo).

I have two possible cases.

  1. Regular view of an array, with all elements that have to be processed.
struct CoOrdinateInfoUnMasked <: CoOrdinateInfo
      Begin::NTuple{2,Int}
      Size::NTuple{2,Int}
end
  1. Masked view of an array, with only the valid elements to be processed.
struct CoOrdinateInfoMasked <: CoOrdinateInfo
      Begin::NTuple{2,Int}
      Mask::Array{Bool,2}
end

Now I create an array CoOrdinateInfos::Array{CoOrdinateInfo,2} which is either updated by the user by modifying elements in place or utilized for doing something on A using an element CoOrdinateInfos[dim1,dim2]. In both cases we can extract the size (in the masked case size(Info.Mask) will give the size)

This of course creates allocations since none of the code has a clue about what the type is.

Note: The performance even with allocations is very good for my use case but would like to remove allocations to guarantee better timing since its used in a reactive program and be able to potentially compile to smaller binaries.

I am trying to see what’s the best way to optimize my code further.

One simple way is to just collapse both into one struct

struct CoOrdinateInfo
     Begin::NTuple{2,Int}
     Size::NTuple{2,Int}
     Mask::Array{Bool,2}
end

and set mask to be an array of 0 size if it is non sparse and just write a simple check at the calling function based on this (Most of these choices are runtime anyway since the data is very adaptive). But this can be hard to scale if I had more parameters or more variants to account for. Additionally, I have found most of my Julia code to be self-documenting, and this goes against that principle a little bit.

I am sure that some kind of packing heterogeneous information into a homogeneous container is required (unless there’s optimizations that can look into the worst-case size of all possible combinations and allocate along with a tag like a symbol, somewhat like a lisp machine), but I’m curious of elegant ways to do it while the general code looks very readable. Something like a macro or something that would combine to get a set of fields for a given set of information types (essentially a macro that emulates a lisp machine on my current machine). Is what I’m looking for called sum types/tagged union? (are Enums and traits related in any way)?

If there’s a package that can handle these scenarios with additional configurability, can I have an example of how to replicate the above case, and if possible additional things I could do.

there are a few packages implementing sum types like SumTypes.jl and Moshi.jl . you may also find this recent PR interesting as a demonstration of a relevant technique: it implements a manual union-split for a similar use case (fast processing of an array of heterogenous types)

2 Likes

Looks like for my use case there’s a simpler way that does check the worst-case size of all possible combinations and allocate along with a tag (suggestion by Claude Code AI). It’s called Union Type.

The code now becomes

struct CoOrdinateInfoUnMasked
      Begin::NTuple{2,Int}
      Size::NTuple{2,Int}
end
struct CoOrdinateInfoMasked
      Begin::NTuple{2,Int}
      Mask::Array{Bool,2}
end
const CoOrdinateInfo = Union{CoOrdinateInfoUnMasked,CoOrdinateInfoMasked}

The array instantiation and my program remains the same.

Is StructArrays.jl something you are aware of? I am not sure how your use case would look like. But it sounds like you are dealing with arrays of structs, which StructArrays should help.

That won’t scale well in general. Variables inferred with small Union types are used to make small branches (Union-splitting) where each variable has a concrete type and can enable more optimizations. That doesn’t go far by default (3) and increasing that limit bloats compilation to unworkable levels. That’s exactly why manually specified sum types exist. We usually stick to small Union splitting that don’t propagate to more types like nothing or missing.

It’s worth noting that you can run into situations with larger Unions where the compiler still infers things well, but it’s in very brittle and also unscalable circumstances like the applicable methods being few in number or having the same concrete return type.

1 Like

I would say relying on Union-splitting here is entirely reasonable. The current proposal only includes 2 types, for which this definitely works and is probably pretty close to optimal. Meanwhile, it’s much more idiomatic. Sum types are fine too, but not necessary in this case (in my opinion).

I can’t recall the history any more and didn’t spend enough time reading #37378, but it looks like there is no longer a MAX_UNION_SPLITTING limit for simple cases (e.g., at least in cases where single-dispatch is sufficient). Here’s an example showing perfectly respectable performance for a 20-element Union:

types = DataType[]
for i in 1:20
    type = Symbol(:A, i)
    quote
        struct $type end
        getvalue(::$type) = ($i)^2
    end |> eval
    push!(types, eval(type)) # add to our list of types
end

UnionAs = Union{types...} # a Union of all the types
UnionVec = UnionAs[t() for t in types] # a vector of a large Union
A1Vec = fill(A1(), length(UnionVec)) # a homogeneous vector

using BenchmarkTools
@btime sum(getvalue, $A1Vec) # the compiler optimizes this to getvalue(A1()) * length(A1Vec)
#   2.539 ns (0 allocations: 0 bytes)
# 20
@btime sum(getvalue, $UnionVec)
#   8.759 ns (0 allocations: 0 bytes)
# 2870

Granted that branch prediction might be making this benchmark slightly better than “typical” (given the short vector), but a sum type would have the same issue.

Even with a Union of 100 types I benchmark 58ns, which is only a bit slower per element than 20 was.

2 Likes

Thanks for the suggestion. For my case, I’m unsure if splitting it into two arrays (and using StructArrays to make it viewable as a single array) helps since my internal code branches quite a bit anyway. But there’s another part of the code (not shared above) which stores one more field. Maybe there it helps.

That explains a few weird things I’ve run into over the last few years, thanks. I really missed that memo. Just a couple comments to add onto that:

  • It really does turn into a massive branch, check foo(A) = getvalue(first(A)); @code_llvm foo(UnionVec). In this specific example, most of it is actually stored as an external table of values for easy lookup (not sure how it really works), but the code branches do show up if you do more than i in 1:126 or do a mutable struct to overcome the isbits Union storage optimization (Base.allocatedinline(UnionAs) == true && sizeof(UnionVec) == 0), or if you branch to code that can’t be compiled away.
  • If “linear signature” really means one argument with a Union, then WrappedUnions.jl aims to detect and split concretely wrapped unions at multiple arguments of a call site.

Large branchy code size is a drawback of both, it doesn’t generally work like a pointer to an external table of pointers to virtual functions. It’s infeasible to make such a table for ad hoc Unions, so I expect any future approach to wrap unions as well, possibly multiple ones for a particular call site.