All,

I have a critical portion of code that essentially involves the multiplication of many pre-computed matrices by many pre-computed vectors. Each matrix multiplies the same, entire list of vectors. Each matrix is the same size (N1xN2) in my current form, but eventually I’d like to remove this requirement, so that each matrix-vector combination can have arbitrary size. In Matlab I might think of this as a cell-array of matrices and a cell array of vectors. *But for now I’m just trying to understand performant Julia on this “array of matrices” case.*

** Pseudocode**:

```
M(N1,N2,nMatrix) x D(N2,nVector) ==> R(N1,nVector,nMatrix)
```

*MWE*

```
## Julia 1.0.2
N1 = 7; # Always
N2 = 8; # Always
nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8
nVector = 1000; # Usually varies between 1e4 and 2e5
M = rand(N1,N2,nMatrix);
V = rand(N2,nVector);
function Naive(M,V)
N1 = size(M,1);
nMatrix = size(M,3);
nVector = size(V,2);
R = zeros(N1,nVector ,nMatrix);
for i = 1:nMatrix # Iterate through the matrices
for j = 1:nVector # Iterate through the vectors
R[:,j,i] = M[:,:,i] * V[:,j];
end
end
return R
end
function myBest(M,V)
N1 = size(M,1);
nMatrix = size(M,3);
nVector = size(V,2);
R = zeros(N1,nVector ,nMatrix);
for i = 1:nMatrix # Iterate through the matrices
R[:,:,i] = M[:,:,i] * V; # Matrix-Matrix multiplication is equivalent to naive approach
end
return R
end
```

**Benchmarking these:**

```
import BenchmarkTools
BenchmarkTools.@btime Naive(M,V);
#389.538 ms (3000002 allocations: 846.86 MiB)
BenchmarkTools.@btime myBest(M,V);
#64.830 ms (3002 allocations: 107.41 MiB)
```

Which isn’t too shabby; While not quite as fast as *single-threaded* Matlab 2018b, we’re on the same order of magnitude - so I chalk this up to different math libraries:

**Matlab 2018b Comparison**

```
LASTN = maxNumCompThreads(1);
CURRN = maxNumCompThreads();
disp(['Number of Computational Threads = ' num2str(CURRN)])
# Number of Computational Threads = 1
function R = myBest_Matlab(M,V)
N1 = size(M,1);
nMatrix = size(M,3);
nVector = size(V,2);
R = zeros(N1,nVector,nMatrix);
for i = 1:nMatrix
R(:,:,i) = M(:,:,i) * V;
end
end
f = @() myBest_Matlab(M,V);
t = timeit(f,1);
disp(['Time Elapsed = ' num2str(t * 1e3) ' ms'])
# Time Elapsed = 49.5614 ms
```

Are there any thoughts on this? As a long-time Matlab user, that’s the Julia that seems most natural to me… but are there other things I’ve missed… preferably without completely obfuscating my code?

**Follow-up question: Parallel-computing**

Assuming that I have mostly maxed out performance of Julia, I’ve looked into parallel computations. I have a 12 physical-core workstation (and access to a 72-core node) and it sure would seem like I could parallelize across the matrices. But when I *naively* try to use multi-threading:

```
## export JULIA_NUM_THREADS=6
import Base.Threads
function myNaiveThreaded(M,V)
N1 = size(M,1);
nMatrix = size(M,3);
nVector = size(V,2);
R = zeros(N1,nVector ,nMatrix);
Threads.@threads for i = 1:nMatrix # Iterate through the matrices
R[:,:,i] = M[:,:,i] * V; # Matrix-Matrix multiplication is equivalent to naive approach
end
return R
end
BenchmarkTools.@btime myNaiveThreaded(M,V);
#34.264 ms (2976 allocations: 106.60 MiB)
```

And tracking the `julia.exe`

process in task manager shows >14GB RAM utilization (128GB available), plus @btime takes about a minute to run. So not only is the code not “6-threads = ~6x faster” but it’s eating up alot of RAM too. As mentioned in my comments on the first Julia code, the total number of matrices is usually very large (1e4 - 1e8) so I think I’d rather use shared memory so that I don’t have nProc * sizeof(M) from multiple processes with their own copies of the data. Thoughts here?