What would the equivalent call out from Python to Julia with JuliaCall look like, and how would it perform in terms of both time and numerical accuracy (see the noted issue in the post)?
Is it also possible to now use PyCall from Julia to effectively get to Mojo (assuming there were some advantage to doing so beyond “because we can”)?
The Julia code could look very similar or be as simple as prod(1:Int(n)). The overflow problem (prod(1:100) == 0 in Julia) has nothing to do with the languages, it’s the Mojo implementation explicitly converting the arbitrary precision int input from Python to Mojo’s system word-sized Int, which like Julia’s Int alias is 32 bits on 32-bit systems and 64 bits on 64-bit systems. Simply multiplying a sequence of such integers would probably compile to the same native code, hence the same performance and integer overflow. I don’t know how Mojo would opt into arbitrary precision integer arithmetic, but in Julia it’s just working with BigInt.
Assuming you’re asking about PythonCall instead of PyCall, I’d assume it would just work because Python and Mojo are handling their integration. I haven’t gotten into Mojo yet so I wouldn’t really know if there are any issues though. The blogpost seems to provide a MWE so anyone with Mojo can give that and PythonCall a shot.
The timings in that blogpost aren’t reliable. At first glance, it makes no sense that mojo_module.factorial(10) would take longer than mojo_module.factorial(100), and the value recorded for math.factorial(100) is actually math.factorial(125).
>>> from math import factorial
>>> factorial(125)
188267717688892609974376770249160085759540364871492425887598231508353156331613598866882932889495923133646405445930057740630161919341380597818883457558547055524326375565007131770880000000000000000000000000000000
Including the print and only calling the factorial function once in the benchmark is the likely reason for the nonsensical timings. We can account for that with the recommended benchmarking tools. In Julia v1.11.5 and with BenchmarkTools.jl:
julia> using BenchmarkTools
julia> @btime prod(1:$10);
7.000 ns (0 allocations: 0 bytes)
julia> @btime prod(1:$100); # we know the output is wrong, but at least it's taking longer
30.050 ns (0 allocations: 0 bytes)
julia> @btime prod(1:$(big(10))); # arbitrary precision
4.457 μs (148 allocations: 3.18 KiB)
julia> @btime prod(1:$(big(100)));
47.800 μs (1588 allocations: 34.19 KiB)
julia> @btime factorial($(big(100))) # much more efficient than prod!!
318.696 ns (4 allocations: 104 bytes)
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
There are likely better tools, but I really only use the timeit standard library in CPython. Output is in seconds, so you can divide by the million runs to microsecond (μs) estimates per call:
>>> from math import factorial
>>> import timeit
>>> timeit.timeit('factorial(100)', number=1000000, globals=globals())
0.9695415999740362
>>> factorial(100)
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
The takeaways here:
Computing with fixed-size integer instances is much more efficient (10^3 x) than the same number of arbitrary precision instances, though the range of inputs for a factorial method is impractically small. In fact, Julia’s factorial implementation for Int just looks up a precomputed table of the mere 20 values that don’t overflow, and the table for UInt128 only extends that to 34 values, the 34th being unrepresentable by the signed Int128. Those tables are only 160 and 544 bytes of data.
Working with separately allocated and semantically immutable BigInt values contributes significant heap allocation overhead, and ditching any optimization for a simple prod doesn’t help either. factorial(::BigInt) simply forwards to GMP’s fac_ui routine (in C), and it uses an optimized algorithm and minimizes allocations for greater efficiency (~10^2 x).
CPython’s math.factorial(also in C) also uses an optimized algorithm (with some similarities to GMP’s) for its own arbitrary precision int, and it’s also more efficient than prod(1:big(100)), though it falls a bit short (~3x) of GMP’s. A Mojo reimplementation is unlikely to do better than either CPython or GMP.