There is definitely something wrong here. Anything more than a factor of 2 difference is suspicious.
Can you make an absolute minimal example in both Matlab and Julia to show both code and benchmarking? Right now there is a bit too much code and things going on in your example, so I tried to condense it.
Here is my MWE, using Julia 1.6 (OpenBLAS) and Matlab R2021a (MKL). Does it capture your situation?
Julia:
using BenchmarkTools
N, M = 1000, 256
A = rand(ComplexF64, N, M^2)
B = rand(ComplexF64, M, M)
testmul(X, Y) = X * vec(Y)
Matlab:
N = 1000;
M = 256;
A = rand(N, M^2) + 1i * rand(N, M^2);
B = rand(M,M) + 1i * rand(M,M);
Benchmarks:
jl> @btime testmul($A, $B);
39.787 ms (3 allocations: 15.83 KiB)
>> timeit(@()A * B(:))
ans =
0.0410
These speeds are basically identical.
(BTW: Instead of this: typeof(Test_mat[1,1]), it is more idiomatic to use eltype(Test_mat))