Speaking as a SymPy maintainer I feel like I can offer lots of useful things to this conversation but I have tried to skim this (extremely long) thread and I’m not sure I’ve followed everything. I don’t know whether it’s useful for me to add my thoughts in a thread that is already so lengthy (maybe a new thread would be better). I have some more constructive comments to make but this doesn’t seem like the right place for them because any comment here will just get buried.

In any case let me explain that SymPy can be much faster while still being implemented in Python. It is slow for some cases because of basic design decisions. The same problems can afflict a CAS implemented in Julia. SymPy is slow in other cases because the intention was always that those would be at some point be implemented in another language in support libraries.

Let’s give an example to be clear what we mean about “slow”. The example here is intended to be illustrative so don’t read too much into the details.

Suppose we have a matrix of integers and we want to do matrix multiplication. In SymPy we can do:

```
In [1]: M = Matrix(range(1, 100+1)).reshape(10, 10)
In [2]: %time ok = M**2
CPU times: user 75.1 ms, sys: 8.59 ms, total: 83.6 ms
Wall time: 103 ms
```

There is now an internal more efficient matrix class in sympy that is faster for this example (note I’m using sympy master):

```
In [3]: from sympy.polys.domainmatrix import DomainMatrix
In [4]: dM = DomainMatrix.from_Matrix(M)
In [5]: %time ok = dM**2
CPU times: user 391 µs, sys: 0 ns, total: 391 µs
Wall time: 398 µs
```

Some work is involved in making that faster class be used in a way that is transparent to users but it has the potential to significantly improve speed. The flint library (`python_flint`

) can be used to do these same computations even faster e.g.:

```
In [7]: from flint import fmpz_mat
In [8]: fM = fmpz_mat([[int(e) for e in row] for row in M.tolist()])
In [9]: %time fM**2
CPU times: user 36 µs, sys: 33 µs, total: 69 µs
Wall time: 26 µs
```

I’m currently working on making that happen implicitly in sympy for performance. The way this works is that we delegate the computationally intensive tasks to a lower-level library and then the Python code orchestrates that lower-level library.

Repeating the same in Julia leads to results that are slower than even the slowest sympy case:

```
N = 10
Ar = reshape(1:N^2, N, N) * BigInt(1)
@time Ar^2
```

Running that gives:

```
$ julia demo.jl
0.559016 seconds (1.04 M allocations: 53.008 MiB, 1.48% gc time)
```

I’m only using this example for illustrative purposes. I’m sure that some of you can show me how to make that faster or explain that I’m not timing it properly. Any feedback on that is welcome because I want to learn more but please also understand that many of the details are not that important in the point that I am making. I intend for SymPy to reach the speed of flint like:

```
In [1]: N = 1000
In [2]: from flint import fmpz_mat
In [3]: m = fmpz_mat([[i*N + j for j in range(N)] for i in range(N)])
In [4]: %time ok = m**2
CPU times: user 2.27 s, sys: 54.4 ms, total: 2.32 s
Wall time: 2.37 s
```

How can sympy reach that speed? We can just use flint!

I tried this in Julia:

```
N = 1000
Ar = reshape(1:N^2, N, N) * BigInt(1)
@time Ar^2
```

and I got:

```
$ julia demo.jl
801.397699 seconds (5.01 G allocations: 89.637 GiB, 19.98% gc time)
```

The equivalent for sympy’s internal (pure-)Python class is:

```
In [9]: from sympy.polys.domainmatrix import DomainMatrix
In [10]: dM = DomainMatrix([[ZZ(i*N + j) for j in range(N)] for i in range(N)], (N, N), ZZ)
In [11]: %time ok = dM**2
CPU times: user 3min 34s, sys: 1.92 s, total: 3min 36s
Wall time: 3min 43s
```

I’m sure that a lot of work can be done in Julia to improve that performance. The ultimate question from a speed perspective is: can Julia be faster than flint? If not then we end up in a situation where both sympy and Symbolics.jl end up using the same underlying library. It is possible (although a huge amount of work) to build a much better CAS than sympy but aiming for feature parity with sympy but with better speed seems like a non-goal to me: it would be much easier to make sympy faster.