Hey thanks for the reply. I had indeed looked at TensorOperations, but I do need everything in 2D matrix form. I improved the code further (with some help) and managed to get the partial trace even faster:
julia> using Schrodinger, BenchmarkTools
julia> A = rand(36,36); out = (2,); sysdims = (2,9,2);
julia> @benchmark ptrace($A,$out,$sysdims)
BenchmarkTools.Trial:
memory estimate: 480 bytes
allocs estimate: 5
--------------
minimum time: 1.275 μs (0.00% GC)
median time: 1.431 μs (0.00% GC)
mean time: 1.520 μs (2.17% GC)
maximum time: 181.442 μs (97.48% GC)
--------------
samples: 10000
evals/sample: 10
The code is now available here (the package itself is not fully ready for prime time) for those interested.