Python big integer vs. Julia big integer arithmetic

The state-of the art is currently that it is necessary to use specialized timing utilities to compare timings. Does this make the outcome more useful?
The current outcome is that the results depend … varying between Python being four times faster than Julia up to Julia being four times faster than Python.
In other words also the more appropriate way of bench-marking the code does not provide a generally valid answer about comparison of performance between Python and Julia. So there are apparently system parameter in play here with an impact on the result being greater than the impact of the programming language …

I just started an ipython REPL from the command line. The ipython repl is far and away better than the regular python repl.

Yeah, always take benchmarks with at least a grain of salt.
Relative performance can vary quite a bit from system to system.

On that note, 2.6GHz Intel I5 isn’t enough information to have any idea whether results may be similar. Knowing the generation of Intel CPU is much more important, e.g. 2nd generation Intel core lacks FMA instructions, and has substantially fewer instructions per clock on most workloads than 14th gen Intel in general.

Things like having FMA instructions or not can make a bigger difference in Julia than other languages without JITs, because Julia will generally target the host computer, while in other languages you often* get a shared common-denominator binary that won’t use them.

*But not always. Applications where it is most important will be able to select the appropriate routine. There’s also things like Intel’s distribution of python that has some optimized routines.

My earlier timing was on an Intel 9940x.
On my M1 mac:

julia> @btime powermod(3,100000000000,102)
  712.216 ns (0 allocations: 0 bytes)
In [1]: %timeit pow(3,100000000000,102)
799 ns Β± 2.01 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
2 Likes

In Julia on Linux, it is easy to investigate this sort of thing.
On another Intel CPU:

julia> using LinuxPerf; foreachf(f::F, N, args::Vararg{Any,K}) where {F,K} = foreach(_->Base.donotdelete(f(args...)),1:N)
foreachf (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" begin
           foreachf(powermod, 1_000_000, 3, 100000000000, 102)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.85e+09  100.0%  #  2.7 cycles per ns
β”Œ instructions             4.07e+09  100.0%  #  2.2 insns per cycle
β”‚ branch-instructions      4.99e+08  100.0%  # 12.3% of insns
β”” branch-misses            9.81e+04  100.0%  #  0.0% of branch insns
β”Œ task-clock               6.75e+08  100.0%  # 674.7 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              1.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Executing powermod 1 million times, I get an average of 2.2 IPC.
I can also see that a single powermod requires about 4 thousand instructions (given that a million needed 4.07e9), etc.
This can make it easier to compare implementations.

For dependency chains, or if you have a particular piece of assembly to look at, tools like uica are great.

5 Likes

Your aversion to language-specific software is misplaced. Putting aside the capabilities, let’s look back at your command line workflow again. You start a timer, launch a language runtime, possibly JIT-compile some code, run the code, close the runtime, then check the timer. Only a fraction of that benchmark involved the big integer arithmetic you are interested in comparing. You had the idea of doing a benchmark without the code and subtracting it. If timings didn’t vary randomly, then that would be a good idea. They do, so you’re getting results with even more variation, to the point of impossible negative values.

Here’s an alternative: what if we start the timer strictly before running the code and check the timer right after? This would have to occur while the runtime is active, and this is most naturally done with a wrapper written in that language. Didn’t you opine in another thread a couple days ago that all languages are redundant because they ultimately run machine code? That missed the mark significantly, but this is where you had a point because the wrapper of the timer would boil down to the system’s specific routine in any language. Repeat the calls to amortize the latency of operating system interrupts, and you got yourself a barebones benchmarking library.

This doesn’t even preclude a command line interface. The py command allows timeit usage:

>py -m timeit "pow(3,100000000000,102)"
500000 loops, best of 5: 847 nsec per loop

That means in 5 benchmarks with 500,000 loops each, the fastest benchmark averaged 847 ns per loop. Pretty good way to reduce the influence of OS jitter, though it can’t eliminate it; running it again gave 854ns for example.

And for the curious, CPython does powermod ~4x faster (not this much, following comment shows spurious factor due to uncharging laptop and persistent runtime) than Julia on my system:

julia> @benchmark powermod($3, $100000000000, $102)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.362 ΞΌs …  12.400 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.400 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.432 ΞΌs Β± 254.298 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
...
 Memory estimate: 0 bytes, allocs estimate: 0.

I get a similar result if I try to match CPython’s benchmark parameters

julia> @btime powermod($3, $100000000000, $102) samples=5 evals=500_000
  3.395 ΞΌs (0 allocations: 0 bytes)
69

Correct, welcome to benchmarking. With enough money you can get your hands on a climate-controlled machine running a real-time operating system to minimize the variation, but that’s not useful to the users on a wide variety of systems. Best to include some standard error bars, specify the platform, and move on to optimizing bigger performance discrepancies.

3 Likes
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" begin
                  foreachf(powermod, 1_000_000, 3, 100000000000, 102)
              end
ERROR: UndefVarError: `foreachf` not defined

Why did you only copy-paste half of his code?

2 Likes

Good question … Now I can’t find in the discussion thread the other half which defines the function … but I can remember it was there,
Now I know … I had to run Julia as root … and forgot the first half in the new REPL …

~ $ sudo $(which julia)
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.4 (2024-06-04)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using LinuxPerf

julia> foreachf(f::F, N, args::Vararg{Any,K}) where {F,K} = foreach(_->Base.donotdelete(f(args...)),1:N)
foreachf (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" begin
                  foreachf(powermod, 1_000_000, 3, 100000000000, 102)
              end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               3.01e+09   50.0%  #  2.6 cycles per ns
β”Œ instructions             4.07e+09   50.0%  #  1.3 insns per cycle
β”‚ branch-instructions      4.99e+08   50.0%  # 12.3% of insns
β”” branch-misses            1.34e+04   50.0%  #  0.0% of branch insns
β”Œ task-clock               1.15e+09  100.0%  #  1.1 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 

13 posts were split to a new topic: Optimizing powermod

Gonna have to take the timeit timings with a grain of salt now because I figured I might as well learn the library but it appears that the CLI timings don’t match the API timings:

___> py -m timeit "pow(3,100000000000,102)"
500000 loops, best of 5: 840 nsec per loop
___> py
Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit; timeit.main(["pow(3,100000000000,102)"]) # CLI internal
100000 loops, best of 5: 2.93 usec per loop
>>> t=timeit.Timer("pow(3,100000000000,102)"); r=5; n=500_000; [x/n for x in t.repeat(r,n)]
[2.932496200082824e-06, 2.999519200064242e-06, 2.920207800110802e-06, 2.9131800001487138e-06, 2.923075399827212e-06]

I looked into the internal main that the CLI command calls, and I don’t see anything off. The 3-4x differences are too consistent over repeats for it to be random.
PS an extreme system variation on the Julia side, a sudden 3.4x speedup:

julia> @benchmark powermod($3, $100000000000, $102) samples=5 evals=500_000
BenchmarkTools.Trial: 5 samples with 500000 evaluations.
 Range (min … max):  1.001 ΞΌs … 1.007 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.002 ΞΌs             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.003 ΞΌs Β± 2.348 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
...
 Memory estimate: 0 bytes, allocs estimate: 0.

Update: Plugged my laptop to charge and almost made a stackoverflow post, but timeit became a consistent 750-850ns everywhere, while BenchmarkTools is giving 850-1050ns. Wish I could say the earlier CLI vs function call discrepancy was unusual, but I wouldn’t put it past some level of a computer to decide that running off a battery and a persistent runtime was worth throttling the CPU. Should have expected this, below a 10x difference and you have to start worrying about looking at the CPU wrong, let alone risk luxuries like background music.

>>> timeit.main(["pow(3,100000000000,102)"]) # 85% unplugged
100000 loops, best of 5: 3.16 usec per loop
>>> timeit.main(["pow(3,100000000000,102)"]) # charging plugged
500000 loops, best of 5: 779 nsec per loop
>>> timeit.main(["pow(3,100000000000,102)"]) # 85% unplugged again
100000 loops, best of 5: 3 usec per loop

In general it is better to benchmark something that returns a result and takes at least a significant amount of time. Then you can use the system or subjective performance metric:

% time python3 test.py 
444105939

real	0m12,528s
user	0m12,520s
sys	0m0,008s

% time julia test.jl
444004992

real	0m4,237s
user	0m4,530s
sys	0m0,852s

With:

import random
n = 10_000_000
s = 0
for i in range(n-1) :
    s += pow(3, 100000000000 + random.randint(1,10), 102)
print(s)

and

function main() 
    s = 0
    for i in 1:10_000_000
        s += powermod(3, 100000000000 + rand(1:10), 102)
    end
    return s
end
println(main())

Yet, here, the performance of the integer random number generator is also in affecting the test.

1 Like
~ $ time julia test.jl
444047862

real	0m12.306s
user	0m12.296s
sys	0m0.132s
~ $ time python3.13 test.py
444026013

real	0m21.877s
user	0m21.872s
sys	0m0.004s