I have a critical portion of code that essentially involves the multiplication of many pre-computed matrices by many pre-computed vectors. Each matrix multiplies the same, entire list of vectors. Each matrix is the same size (N1xN2) in my current form, but eventually I’d like to remove this requirement, so that each matrix-vector combination can have arbitrary size. In Matlab I might think of this as a cell-array of matrices and a cell array of vectors. But for now I’m just trying to understand performant Julia on this “array of matrices” case.
M(N1,N2,nMatrix) x D(N2,nVector) ==> R(N1,nVector,nMatrix)
## Julia 1.0.2 N1 = 7; # Always N2 = 8; # Always nMatrix = 1000; # Usually varies between 1e4 - 1e7, sometimes as high as 1e8 nVector = 1000; # Usually varies between 1e4 and 2e5 M = rand(N1,N2,nMatrix); V = rand(N2,nVector); function Naive(M,V) N1 = size(M,1); nMatrix = size(M,3); nVector = size(V,2); R = zeros(N1,nVector ,nMatrix); for i = 1:nMatrix # Iterate through the matrices for j = 1:nVector # Iterate through the vectors R[:,j,i] = M[:,:,i] * V[:,j]; end end return R end function myBest(M,V) N1 = size(M,1); nMatrix = size(M,3); nVector = size(V,2); R = zeros(N1,nVector ,nMatrix); for i = 1:nMatrix # Iterate through the matrices R[:,:,i] = M[:,:,i] * V; # Matrix-Matrix multiplication is equivalent to naive approach end return R end
import BenchmarkTools BenchmarkTools.@btime Naive(M,V); #389.538 ms (3000002 allocations: 846.86 MiB) BenchmarkTools.@btime myBest(M,V); #64.830 ms (3002 allocations: 107.41 MiB)
Which isn’t too shabby; While not quite as fast as single-threaded Matlab 2018b, we’re on the same order of magnitude - so I chalk this up to different math libraries:
Matlab 2018b Comparison
LASTN = maxNumCompThreads(1); CURRN = maxNumCompThreads(); disp(['Number of Computational Threads = ' num2str(CURRN)]) # Number of Computational Threads = 1 function R = myBest_Matlab(M,V) N1 = size(M,1); nMatrix = size(M,3); nVector = size(V,2); R = zeros(N1,nVector,nMatrix); for i = 1:nMatrix R(:,:,i) = M(:,:,i) * V; end end f = @() myBest_Matlab(M,V); t = timeit(f,1); disp(['Time Elapsed = ' num2str(t * 1e3) ' ms']) # Time Elapsed = 49.5614 ms
Are there any thoughts on this? As a long-time Matlab user, that’s the Julia that seems most natural to me… but are there other things I’ve missed… preferably without completely obfuscating my code?
Follow-up question: Parallel-computing
Assuming that I have mostly maxed out performance of Julia, I’ve looked into parallel computations. I have a 12 physical-core workstation (and access to a 72-core node) and it sure would seem like I could parallelize across the matrices. But when I naively try to use multi-threading:
## export JULIA_NUM_THREADS=6 import Base.Threads function myNaiveThreaded(M,V) N1 = size(M,1); nMatrix = size(M,3); nVector = size(V,2); R = zeros(N1,nVector ,nMatrix); Threads.@threads for i = 1:nMatrix # Iterate through the matrices R[:,:,i] = M[:,:,i] * V; # Matrix-Matrix multiplication is equivalent to naive approach end return R end BenchmarkTools.@btime myNaiveThreaded(M,V); #34.264 ms (2976 allocations: 106.60 MiB)
And tracking the
julia.exe process in task manager shows >14GB RAM utilization (128GB available), plus @btime takes about a minute to run. So not only is the code not “6-threads = ~6x faster” but it’s eating up alot of RAM too. As mentioned in my comments on the first Julia code, the total number of matrices is usually very large (1e4 - 1e8) so I think I’d rather use shared memory so that I don’t have nProc * sizeof(M) from multiple processes with their own copies of the data. Thoughts here?