Automatically fusing together several for-loops

If it is 100-200 functions, I don’t think using a tuple makes sense. I’d use a tuple if there are only “handful” of elements (let’s say < 16, as that’s the heuristics Base uses; but the compiler can handle more).

Note that there is no way to express “signature” (input types and output type) in Julia’s type system. That’s why you need the ccall hack to get some decent performance.

This is why I asked if you have closures or some callable objects. That is to say, do you have 100 functions with completely different implementations? Or, are they actually some parameterized functions? If they are closures generated by the same function, their type is identical:

julia> create_adder(value) = (x) -> x + value;

julia> typeof(create_adder(1)) === typeof(create_adder(2))

So, you can put them in a vector without invoking run-time dispatch:

julia> callfirst(fs, x) = first(fs)(x);

julia> @code_warntype callfirst([create_adder(1), create_adder(2)], 0)
  #self#::Core.Const(callfirst, false)

1 ─ %1 = Base.getindex(fs, 1)::var"#1#2"{Int64}
│   %2 = (%1)(x)::Int64
└──      return %2

Even if not all function types are identical, I’d imagine there are only handful of function types. If that’s the case, you can use Iterators.flatten to group closures/functions by their type, e.g., Iterators.flatten(([create_adder(1), create_adder(2)], [create_adder(1im)])).

Unfortunately, Julia’s native for loop is not powerful enough to completely optimize a complex iterator like Iterators.flatten. You’d need to use Base.foldl or some external packages like FLoops.jl (ref [RFC/ANN] FLoops.jl: fast generic for loops (foldl for humans™)) to eliminate dynamic dispatches.

Yes, I think it’d be a better way to parallelize the computation. FWIW, if your graph object already defines Base.iterate (and, preferably, Base.length), you can simply add SplittablesBase.halve to support parallel computations via Transducers.jl, aforementioned FLoops.jl, ThreadsX.jl, etc.

Regarding ThreadPools.jl… I hope I don’t sound like trivializing @tro3’s hard work but I think it’s important to understand that the primary motivation for ThreadPools.jl is to “undo” the design of composable multi-threading in Julia (see Announcing composable multi-threaded parallelism in Julia). IIUC, ThreadPools.jl exists for separating out latency-critical code (executed in the primary thread) from throughput-oriented code (executed in non-primary threads). This is very useful if you are writing, e.g., GUI application but it is not desirable to use it in a library or throughput-oriented user code. ThreadPools.jl is a very clever and useful workaround for the current state of multi-threading in Julia. However, I think it’s a good idea to avoid using it if you mainly care about the “overall speed” (i.e., throughput) of your computation and composability with the rest of the ecosystem.