Thanks all. Okay well I learned something from this exchange and now I can show comparable timings:
In [1]: In [1]: N = 1000
...:
...: In [2]: from flint import fmpz_mat
...:
...: In [3]: m = fmpz_mat([[i*N + j for j in range(N)] for i in range(N)])
In [2]: %timeit ok = m**2
1.41 s ± 52.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That’s compared to
**julia>** using Nemo, BenchmarkTools
Welcome to Nemo version 0.20.1
Nemo comes with absolutely no warranty whatsoever
**julia>** N = 1000; A = matrix(ZZ, reshape(1:N^2, N, N)*BigInt(1));
**julia>** @btime B = A^2;
1.393 s (1268 allocations: 94.52 MiB)
What that shows is that similar performance is possible in both cases. (Although installing and compiling Nemo took a long time and I don’t entirely accept either time result since running the same operation once gives a longer time in both cases)
My real point though was that it’s possible to get the same speed improvements from within Python. I didn’t want to argue about speed but rather to point out some detail of how sympy can be slow but that it is possible to make it faster.
There are more relevant discussions to be had about speed in a CAS. Personally I think that automatic evaluation is the major mistake made in sympy (and many other CAS). That is what currently slows down ordinary operations in sympy. Automatic evaluation is extremely hard to change when backwards compatibility comes into play because any change is not compatible.