I’ll follow your suggestion and take a step back.
I’m currently involved in a field called probabilistic graphical models. Within this field, I’m focusing on problems that involve discrete random variables (variables that can take a finite number of values). A probabilistic graphical model is defined by a set of factors that represent the relationship of the different random variables with each other. Such models can be used to answer interesting questions such as what is the probability that something occurs given that a subset of the variables is observed.
A factor consists of a table of numbers for each instantiation of the random variables involved in that factor, e.g. this is a factor that is a function of 2 binary random variables A and B:

In other words, each factor is a function of a given set of variables in the model, and each of these variables has a given cardinality (number of states). The table size is the multiplication of the cardinality of each of the random variables in the factor.
At this moment, I’m trying to implement an algorithm called Variable Elimination which mainly consists of performing products and marginalizations between the factors. Here is an example of a factor product between 2 factors:
Such factor product generalizes to N factors, where the resulting scope is the union of the scope of the factor operands.
and here is an example of a factor marginalization:
which corresponds to adding the values that correspond to all possible instantiations of the variables that are being marginalized (summed out), which is one (B) in this case.
The function that I shared above and which I’m trying to optimize corresponds to the factor product of N factors. It reshapes the operand factor tables to a common size by adding 1-sized dimensions for those variables that are not present in a given operand but that are present in the resulting scope (the union of variables of all operand factors). This allows me to perform a broadcast multiplication of the reshaped tables to obtain the result.
The problems that I’m focusing on have variables with low cardinalities (2 or 3) but the factors can grow quite large, e.g. a factor with 25 binary variables (33554432 states). The factor products and marginalizations described above are computed many times, hence the importance to optimize these two operations.
My overall goal is to see if I can compete (performance-wise) with the state of the art libraries for this type of problems which are mostly available in C++. My thought is that a Julia implementation could open very interesting possibilities for optimization by making use of the metaprogramming capabilities.

