I was exactly looking for that
Did you try to launch Julia with the parameter:
--gcthreads=8,1
?
8 would be the number of threads for the GC, set it to the number of physical cores of your machine or one or two less.
Your code does have side-effects! Just maybe not in the way you typically think of it. It uses memory, and lots of it. Youâre doing hundreds of millions of allocations, 8GBs worth of it! Memory â and memory bandwidth â is a shared resource. As is the GC.
Itâs not just the matrix itself, but each and every single BigInt
is independently allocated. Worse, every computation with a BigInt
throws it away and allocates a new one. Theyâre not the fastest datatype in the world, thatâs for sure.
Note that the % GC time doesnât count time that the threads wait for a GC to happen. Every time you allocate a new object, the GC will check the âmemory pressureâ to see if it needs to do a collection. If it does, itâll raise a flag and wait for all the threads to hit a âsafe point,â at which point marking (multithreaded) and sweeping (single-threaded) is done. That waiting time isnât accounted for in the metrics that are printed out but may be where some of this is going.
In this case, time to safepoint should be a non-issue. All the threads are allocating, so they should all be seeing the GC request almost immediately.
Here is the best I could get, I hope I didnât change the logic
module
module Secp256k1
using BitIntegers
const A = uint256"0"
const B = uint256"7"
const P = uint256"115792089237316195423570985008687907853269984665640564039457584007908834671663"
const N = uint256"115792089237316195423570985008687907852837564279074904382605163141518161494337"
const Gx = uint256"55066263022277343669578718895168534326250603453777594175500187360389116729240"
const Gy = uint256"32670510020758816978083085130507043184471273380659243275938904335757337482424"
const G = (Gx, Gy)
function jacobian_double(p)
# """
# Double a point in elliptic curves
# :param p: Point you want to double
# :return: Point that represents the sum of First and Second Point
# """
#:param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod )
#:param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod Secp256k1.p)
p_x, p_y, p_z = p
if p_y == 0
return uint256"0", uint256"0", uint256"1"
end
ysq = mod((p_y*p_y), Secp256k1.P)
S = mod((4 * p_x * ysq), Secp256k1.P)
# M = (3 * p_x ** 2 + A * p_z ** 4) % cls.P
# A is zero
M = mod((3 * p_x*p_x), Secp256k1.P)
nx = mod((M*M - 2 * S), Secp256k1.P)
ny = mod((M * (S - nx) - 8 * ysq*ysq), Secp256k1.P)
nz = mod((2 * p_y * p_z), Secp256k1.P)
return nx, ny, nz
end
function jacobian_add(p, q)
# """
# Add two points in elliptic curves
# :param p: First Point you want to add
# :param q: Second Point you want to add
# :return: Point that represents the sum of First and Second Point
# """
p_x, p_y, p_z = p
q_x, q_y, q_z = q
if p_y == 0
return q_x, q_y, q_z
end
if q_y == 0
return p_x, p_y, p_z
end
U1 = mod((p_x * q_z*q_z), Secp256k1.P)
U2 = mod((q_x * p_z*p_z), Secp256k1.P)
S1 = mod((p_y * q_z*q_z*q_z), Secp256k1.P)
S2 = mod((q_y * p_z*p_z*p_z), Secp256k1.P)
if U1 == U2
if S1 != S2
return uint256"0", uint256"0", uint256"1"
end
return jacobian_double(p)
end
H = U2 - U1
R = S2 - S1
H2 = mod((H * H), Secp256k1.P)
H3 = mod((H * H2), Secp256k1.P)
U1H2 = mod((U1 * H2), Secp256k1.P)
nx = mod((R * R - H3 - 2 * U1H2), Secp256k1.P)
ny = mod((R * (U1H2 - nx) - S1 * H3), Secp256k1.P)
nz = mod((H * p_z * q_z), Secp256k1.P)
return nx, ny, nz
end
function jacobian_multiply(p, n)
# """
# Multily point and scalar in elliptic curves
# :param p: First Point to mutiply
# :param n: Scalar to mutiply
# :return: Point that represents the sum of First and Second Point
# """
p_x, p_y, p_z = p
if p_y == 0 || n == 0
return uint256"0", uint256"0", uint256"1"
end
if n == 1
return p
end
if n < 0 || n >= Secp256k1.N
return jacobian_multiply(p, mod(n, Secp256k1.N))
end
if mod(n, 2) == 0
return jacobian_double(jacobian_multiply(p, div(n, 2)))
end
pd = jacobian_double(jacobian_multiply(p, div(n, 2)))
return jacobian_add(pd, p)
end
function to_jacobian(p)
p_x, p_y = p
return p_x, p_y, uint256"1"
end
function _inv(x, n)
# """
# Extended Euclidean Algorithm. It's the 'division' in elliptic curves
# :param x: Divisor
# :param n: Mod for division
# :return: Value representing the division
# """
if x == 0
return uint256"0"
end
lm = uint256"1"
hm = uint256"0"
low = mod(x, n)
high = n
r = uint256"0"
while low > 1
r = div(high, low)
nm = hm - lm * r
nw = high - low * r
high = low
hm = lm
low = nw
lm = nm
end
return mod(lm, n)
end
function from_jacobian(p)
# """
# Convert point back from Jacobian coordinates
# :param p: First Point you want to add
# :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod Secp256k1.p)
# :return: Point in default coordinates
# """
p_x, p_y, p_z = p
z = _inv(p_z, Secp256k1.P)
x = mod((p_x * z^2), Secp256k1.P)
y = mod((p_y * z^3), Secp256k1.P)
return x, y
end
function der_keys(n)
pubj = jacobian_multiply(to_jacobian(Secp256k1.G), n)
return from_jacobian(pubj)
end
end
code
# using Base.Threads
using Base.Iterators
using BenchmarkTools
using BitIntegers
include("./secp256k1.jl")
function gen_keys(k_start, k_num)
k_end = k_start + k_num - 1
k_range = k_start:k_end
data_size = length(k_range)
keys = zeros(UInt256, data_size, 2)
Threads.@threads for i in eachindex(k_range)
k = k_range[i]
x,y = Secp256k1.der_keys(k)
keys[i,1] = x
keys[i,2] = y
end
return keys
end
function test_gen_keys_range()
gen_keys(uint256"123",uint256"100000")
end
test_gen_keys_range()
@time test_gen_keys_range();
benchmark
> julia -t 1 --project main.jl
0.304315 seconds (10 allocations: 6.104 MiB, 11.04% gc time)
> julia -t 20 --project main.jl
0.050477 seconds (105 allocations: 6.117 MiB)
speedup are insane Bit Integer are wow.
And for the 10000 case (which was 2.096 s and 705.314 ms) now we have
> julia -t 1 --project main.jl
0.026495 seconds (10 allocations: 625.828 KiB)
> julia -t 20 --project main.jl
0.006071 seconds (105 allocations: 639.047 KiB)
just tested I get the same exact time with fetching, not linear on my machine but its really memory related (logrange plot)
Even if we assume that parallelization is not obstructed by any realistic factors like memory pressure, the theoretical maximum speedup is sub-linear when you consider that only a fraction of the program is accelerated (Amdahlâs law). Itâs shockingly punishing, a 16x speedup of 95% of the original runtime has an overall â¤9.143x speedup.
Is there not a library for arbitrary precision integers that only allocate if the value exceeds a threshold close to Int
(forgot the terminology for these kind of data structures)? This wouldnât matter if the values are usually higher, though.
To iterate on Bennyâs comment about Amdahlâs law. In your benchmarks for test_gen_keys2
the runtime with 1 thread is roughly 30 seconds with a 15% gc time. Assuming the gc is not parallelized at all and the rest is perfectly parallelized, then 4 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 4 = 10.875
seconds, which is not too far of from the 13 seconds you get. Similarly 16 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 16 = 6.09375
, which again is not too far of from your 9.5 seconds.
One alternative to BitIntegers.jl could be to use BigInt
together MutableArithmetic.jl to reduce the number allocations, that would however likely require rather intrusive changes to the code.
The fmpz
type in the FLINT library works like this. I do however not think this can be fully leveraged in Julia. it would likely require some sort of finalizer to check if an allocation was made or not, and finalizers can only be attached to mutable types and those are tracked by the gc (in principle one could possibly insert the finalizers inline with some escape analysis, I think Julia can do something like that but I donât know how flexible it is).
Looks more or less like how Nemo.jl wraps FLINTâs fmpz
: Nemo.jl/src/flint/FlintTypes.jl at master ¡ wbhart/Nemo.jl ¡ GitHub
I was imagining something more on the Julia side, which wouldnât need finalizers to manage memory allocated in another language:
struct MaybeBig{I<:Integer}
inline::I # except typemax as overflow flag
overflow::BigInt # must trail for incomplete new #36789
MaybeBig{I}(i::I) where I = new(i)
MaybeBig{I}(i::BigInt) where I = new(typemax(I), i)
end
MaybeBig(i::I) where I = MaybeBig{I}(i)
# then all the Number arithmetic methods
except Iâm not actually sure if the incomplete initialization manages zero allocation, hopefully someone can explain if Iâm just benchmarking wrong:
julia> @btime MaybeBig($1) # allocates?
5.500 ns (1 allocation: 32 bytes)
MaybeBig{Int64}(1, #undef)
julia> Base.one(::Type{MaybeBig{I}}) where I = MaybeBig{I}(one(I))
julia> @btime ones($(MaybeBig{Int}), $5) # only allocates Array
126.988 ns (2 allocations: 96 bytes)
5-element Vector{MaybeBig{Int64}}:
MaybeBig{Int64}(1, #undef)
MaybeBig{Int64}(1, #undef)
MaybeBig{Int64}(1, #undef)
MaybeBig{Int64}(1, #undef)
MaybeBig{Int64}(1, #undef)
The ones
function only allocates one instance of one
, so you get one allocation for the array and one for the single allocated element. For example for BigInt
we have
julia> xs = ones(BigInt, 2)
2-element Vector{BigInt}:
1
1
julia> xs[1] === xs[2]
true
I donât know if there is any way around the allocation, I think the gc needs to be involved to keep track of the #undef
as well.
Good catch on the fill
-ing because I used 5 assuming each would add an allocation, but thereâs actually 2+ allocations for Array
s wrapping Memory
buffers in v1.11+, which suggests to me that ones(MaybeBig{Int}, ...)
isnât allocating the one instance. It seems to be corroborated by benchmarking BigInt
arrays:
julia> @btime BigInt($1)
57.099 ns (2 allocations: 40 bytes)
1
julia> @btime Vector{BigInt}(undef, $2)
23.015 ns (2 allocations: 80 bytes)
2-element Vector{BigInt}:
#undef
#undef
julia> @btime Vector{BigInt}(undef, $100); # still 2 allocations for 100 undef
65.164 ns (2 allocations: 928 bytes)
julia> @btime ones(BigInt, $2) # 1 BigInt filling 1 2-Array adds up
82.070 ns (4 allocations: 120 bytes)
2-element Vector{BigInt}:
1
1
Ah, good point! I forgot about the double allocation from Memory
. It seems like the allocation is avoided due to constant folding, so for example
julia> five() = MaybeBig{Int}(5)
five (generic function with 1 method)
julia> @btime five()
0.894 ns (0 allocations: 0 bytes)
MaybeBig{Int64}(5, #undef)
doesnât allocate either. It is still unclear to me if the allocation could be avoided in the general case, or if it relies upon the value being preallocated by the compiler. But at least the situation is better than I thought!
This is a beautiful demonstration of my perspective on optimization hunting in Julia. The satisfaction of taking this (relatively) huge allocated memory intensive operation and boiling it down to 10 allocations is just :chefs_kiss:
And the 100X speed up is pleasant too
No difference.
That 4.4x is nice, though not the ideal 20x. But why do you have any allocations at all? And why more with threading? It could be your limiter. Iâm not sure but Bumper.jl could help, I still though with you switching to a bittype, it wouldnât be needed, already deallocated early. Then you can allocate in a hot loop and it will get freed immediately, stack-like, likely why Python is fast with its RC-GC, but should be even faster with Bumper. Bumper seems to support threads, but Iâve not looked closely at it. BigInt and BigFloat are performance killers, both have better alternatives (in most cases, but since theyâre there, theyâre the go-to for many people), as youâve discovered, why I want them gone from standard JuliaâŚ
@yolhan_mannes, this is a bit strange:
> julia -t 1 --project main.jl
0.304315 seconds (10 allocations: 6.104 MiB, 11.04% gc time)
That GC time, it doesnât happen with 20 threads. And is also bependent on other, when also with just 1 thread and same about of constant overhead 10 allocations.
[A recent change added 1 interactive thread as the new default, but for current versions, you may want to get in the habit of -t 20,1 or -t auto,1. I doubt though it matters in this case.]
Allocation are only comming from the original array created and the fact that a fetch will always infer to Any which also explains why it grows with the number of thread. Thread are limited by memory bandwidth, Julia only tells you heap allocations not stack ones and uint256 takes quite some stacks which explains why thread starts to wait sometimes and stop making the performance better. You can try running the code with a faster processor than mine (I have a 8go ram ) you will get a lot better results
Bumper may get rid of Any inference but it wonât make my ram bigger since serial version of the code (without the macro) wonât allocate at all.
Also, sometimes you actually want a BigInt I think because you can use it as a Ref and code like it was a 1 element array which is hard but may be even better than my version with BitInteger. Heap is not always an enemy and it can reduce stack pressure a lot when well used.
Benny
Guys. Please. I have no Idea why some of you gazlighting me.
I am coding for more than 20 years on different platforms with different languages, I know how threads scaling and computation work. And what I am seeing here is not how it works.
You could have just written it from the beginning that BigInt in Julia is very slow and thread scaling for them is broken.
Here is related issue from 2013(sic!) BigInt and BigFloat performance is poor ¡ Issue #4176 ¡ JuliaLang/julia ¡ GitHub
And simple demonstration of how broken it is - BigInt(1) == 0 is orders of magnitude slower than iszero(BigInt(1))
Its insane to have such inconsistency and not optimize it.
To summarize this. There is a simple test of simple computations without recursion array or something funny (computations are not equal between int and BigInt, they not to compare relative performance but to create load)
using Base.Threads
using Base.Iterators
function gen_bigint(k_r)
a = BigInt(0)
m = BigInt(1234567)
for i in k_r
a = mod(BigInt(i)+a, m) + a
end
return a,Threads.threadid()
end
function test_bigint(it)
nth = Threads.nthreads()
part_size = div(it, nth)
tasks = []
for k_r in partition(1:it, part_size)
t() = gen_bigint(k_r)
push!(tasks, Threads.@spawn t())
end
results = fetch.(tasks)
return results
end
function gen_int(k_r)
a = 0
for i in k_r
a = i+a^2 - div(a,2)
a = mod(a, 666)
end
return a,Threads.threadid()
end
function test_int(it)
nth = Threads.nthreads()
part_size = div(it, nth)
tasks = []
for k_r in partition(1:it, part_size)
t() = gen_int(k_r)
push!(tasks, Threads.@spawn t())
end
results = fetch.(tasks)
return results
end
nth = Threads.nthreads()
ngc = Threads.ngcthreads()
num_semples = 100000000
println("Threads num $nth")
println("GC threads num $ngc")
println("Num samples $num_semples")
println("test_bigint:")
test_bigint(1000)
@time results = test_bigint(num_semples)
println("results: $results")
println("test_int:")
test_int(1000)
@time results = test_int(10000000000)
println("results: $results")
Here is full results for 1 to 16 thread scaling:
Summary
JULIA_NUM_THREADS=1 julia test_bigint_threads.jl
Threads num 1
GC threads num 1
Num samples 100000000
test_bigint:
28.750614 seconds (800.00 M allocations: 16.391 GiB, 26.91% gc time)
results: Tuple{BigInt, Int64}[(61075029928496, 1)]
test_int:
58.943936 seconds (17 allocations: 1024 bytes)
results: [(118, 1)]
JULIA_NUM_THREADS=2 julia test_bigint_threads.jl
Threads num 2
GC threads num 1
Num samples 100000000
test_bigint:
32.696664 seconds (800.00 M allocations: 16.391 GiB, 26.58% gc time)
results: Tuple{BigInt, Int64}[(30537504084030, 1), (30537506905054, 2)]
test_int:
30.375354 seconds (23 allocations: 1.547 KiB)
results: [(10, 1), (118, 2)]
JULIA_NUM_THREADS=3 julia test_bigint_threads.jl
Threads num 3
GC threads num 1
Num samples 100000000
test_bigint:
27.283528 seconds (800.00 M allocations: 16.391 GiB, 37.36% gc time)
results: Tuple{BigInt, Int64}[(20358321455018, 1), (20242228944234, 2), (20525274413684, 3), (73, 1)]
test_int:
20.396126 seconds (35 allocations: 2.641 KiB)
results: [(291, 1), (251, 2), (104, 3), (10, 2)]
JULIA_NUM_THREADS=4 julia test_bigint_threads.jl
Threads num 4
GC threads num 2
Num samples 100000000
test_bigint:
23.888192 seconds (800.00 M allocations: 16.391 GiB, 42.34% gc time)
results: Tuple{BigInt, Int64}[(15268747548911, 1), (15372187425966, 2), (15268757083666, 3), (15448385793992, 4)]
test_int:
15.385837 seconds (35 allocations: 2.641 KiB)
results: [(655, 1), (10, 2), (618, 3), (118, 4)]
JULIA_NUM_THREADS=5 julia test_bigint_threads.jl
Threads num 5
GC threads num 2
Num samples 100000000
test_bigint:
21.201918 seconds (800.00 M allocations: 16.391 GiB, 45.35% gc time)
results: Tuple{BigInt, Int64}[(12214994302332, 1), (12167108529529, 2), (12167100499553, 3), (12467594719346, 5), (12424022523641, 4)]
test_int:
12.579772 seconds (41 allocations: 3.188 KiB)
results: [(392, 1), (624, 3), (251, 2), (454, 4), (118, 5)]
JULIA_NUM_THREADS=6 julia test_bigint_threads.jl
Threads num 6
GC threads num 3
Num samples 100000000
test_bigint:
20.473163 seconds (800.00 M allocations: 16.391 GiB, 46.51% gc time)
results: Tuple{BigInt, Int64}[(10179169253058, 1), (10302597595199, 2), (10262591488746, 4), (10262649550577, 5), (10280752043680, 3), (10396906317057, 6), (1061, 2)]
test_int:
10.784252 seconds (53 allocations: 4.281 KiB)
results: [(559, 1), (392, 4), (434, 6), (624, 2), (10, 5), (251, 3), (37, 3)]
JULIA_NUM_THREADS=7 julia test_bigint_threads.jl
Threads num 7
GC threads num 3
Num samples 100000000
test_bigint:
20.080860 seconds (800.00 M allocations: 16.391 GiB, 46.96% gc time)
results: Tuple{BigInt, Int64}[(8725000975428, 2), (8752963738503, 3), (8725043826323, 1), (8939618798850, 4), (8482380681554, 6), (8883664159759, 5), (8790325776816, 7), (217, 4)]
test_int:
9.833347 seconds (59 allocations: 4.828 KiB)
results: [(470, 1), (35, 5), (82, 2), (236, 6), (481, 3), (302, 7), (251, 4), (37, 2)]
JULIA_NUM_THREADS=8 julia test_bigint_threads.jl
Threads num 8
GC threads num 4
Num samples 100000000
test_bigint:
20.573446 seconds (800.00 M allocations: 16.391 GiB, 47.62% gc time)
results: Tuple{BigInt, Int64}[(7634363638259, 2), (7735101959209, 3), (7686099648729, 1), (7707850373019, 4), (7634357956960, 5), (7716029625162, 6), (7724218630957, 7), (7729639737538, 8)]
test_int:
8.708045 seconds (59 allocations: 4.828 KiB)
results: [(529, 2), (655, 4), (645, 3), (10, 1), (627, 5), (618, 6), (40, 8), (118, 7)]
JULIA_NUM_THREADS=9 julia test_bigint_threads.jl
Threads num 9
GC threads num 4
Num samples 100000000
test_bigint:
20.085430 seconds (800.00 M allocations: 16.391 GiB, 47.25% gc time)
results: Tuple{BigInt, Int64}[(6786098342200, 4), (6858742337621, 1), (7035301071999, 2), (6747408293575, 3), (6730499267950, 5), (6892572827717, 7), (6841716913293, 6), (6846612810022, 8), (6998996253406, 9), (73, 4)]
test_int:
8.987191 seconds (72 allocations: 6.281 KiB)
results: [(20, 3), (392, 4), (291, 2), (624, 5), (125, 7), (251, 1), (279, 6), (454, 9), (104, 8), (10, 8)]
JULIA_NUM_THREADS=10 julia test_bigint_threads.jl
Threads num 10
GC threads num 5
Num samples 100000000
test_bigint:
19.751144 seconds (800.00 M allocations: 16.391 GiB, 48.99% gc time)
results: Tuple{BigInt, Int64}[(6107484437655, 2), (6068298338367, 4), (6083539359841, 6), (6135824798717, 3), (6083535005291, 5), (6107499477187, 7), (6233789800311, 1), (6090096369906, 8), (6212027436953, 10), (6111842616762, 9)]
test_int:
8.062407 seconds (72 allocations: 6.281 KiB)
results: [(559, 3), (392, 2), (434, 6), (624, 5), (10, 1), (251, 8), (1, 9), (454, 10), (585, 4), (118, 7)]
JULIA_NUM_THREADS=11 julia test_bigint_threads.jl
Threads num 11
GC threads num 5
Num samples 100000000
test_bigint:
19.044380 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(5552264329166, 2), (5599801681625, 4), (5607735359291, 5), (5564132037542, 7), (5556241035402, 3), (5738354832830, 6), (5536444219238, 1), (5581969227487, 8), (5552274537560, 10), (5649247656400, 11), (5601761621436, 9), (73, 6)]
test_int:
7.425482 seconds (84 allocations: 7.375 KiB)
results: [(23, 2), (554, 4), (586, 5), (55, 3), (336, 7), (73, 9), (266, 8), (446, 1), (143, 10), (307, 11), (104, 6), (10, 4)]
JULIA_NUM_THREADS=12 julia test_bigint_threads.jl
Threads num 12
GC threads num 6
Num samples 100000000
test_bigint:
18.786027 seconds (800.00 M allocations: 16.391 GiB, 48.78% gc time)
results: Tuple{BigInt, Int64}[(5089577758368, 2), (5105908164086, 7), (5151311388087, 5), (5089588764364, 3), (5131296929334, 8), (5171266823968, 6), (5131321962125, 4), (5109561037625, 10), (5140344519212, 1), (5095056451393, 9), (5198439879982, 12), (5334533161236, 11), (1061, 8)]
test_int:
6.808045 seconds (90 allocations: 7.922 KiB)
results: [(613, 6), (559, 5), (590, 8), (392, 4), (215, 2), (434, 7), (297, 3), (624, 9), (338, 1), (10, 10), (348, 12), (251, 11), (37, 3)]
JULIA_NUM_THREADS=13 julia test_bigint_threads.jl
Threads num 13
GC threads num 6
Num samples 100000000
test_bigint:
18.543839 seconds (800.00 M allocations: 16.391 GiB, 48.72% gc time)
results: Tuple{BigInt, Int64}[(4698059727770, 3), (4641177758222, 2), (4816983273133, 5), (4631066829879, 8), (4739957578823, 4), (4771811346967, 9), (4731617019558, 7), (4812017665843, 11), (4691349323106, 6), (4746661098817, 10), (4795195101698, 1), (4703103281310, 12), (4703141023297, 13), (33717, 3)]
test_int:
6.434850 seconds (96 allocations: 8.469 KiB)
results: [(550, 2), (652, 7), (148, 6), (378, 3), (84, 8), (3, 9), (447, 4), (492, 5), (477, 10), (595, 11), (268, 1), (251, 12), (279, 13), (75, 13)]
JULIA_NUM_THREADS=14 julia test_bigint_threads.jl
Threads num 14
GC threads num 7
Num samples 100000000
test_bigint:
18.876498 seconds (800.00 M allocations: 16.391 GiB, 48.84% gc time)
results: Tuple{BigInt, Int64}[(4362488084766, 2), (4388959700729, 3), (4376464678471, 4), (4476022660834, 5), (4362525243172, 6), (4491583204000, 7), (4469823990974, 8), (4342271028017, 9), (4241171132169, 11), (4454256192151, 10), (4441848824150, 14), (4384284698990, 1), (4395172375712, 13), (4381167478013, 12), (217, 7)]
test_int:
6.653705 seconds (102 allocations: 9.016 KiB)
results: [(330, 3), (470, 2), (604, 7), (35, 4), (236, 5), (82, 6), (434, 8), (236, 10), (307, 9), (481, 12), (23, 11), (302, 13), (239, 1), (251, 14), (37, 10)]
JULIA_NUM_THREADS=15 julia test_bigint_threads.jl
Threads num 15
GC threads num 7
Num samples 100000000
test_bigint:
19.011303 seconds (800.00 M allocations: 16.391 GiB, 48.50% gc time)
results: Tuple{BigInt, Int64}[(4071653499242, 4), (4022313532934, 6), (4017958718972, 7), (4115187781471, 8), (4112345312084, 5), (4084773231120, 3), (3980240712510, 2), (4123941346023, 9), (4115181597005, 11), (4184914383280, 10), (4128273187734, 12), (3962835775313, 14), (4071671002179, 13), (4090538393487, 1), (4136986011575, 15), (66485, 7)]
test_int:
6.137018 seconds (108 allocations: 9.562 KiB)
results: [(37, 5), (37, 6), (37, 2), (37, 8), (37, 3), (37, 12), (37, 9), (37, 7), (37, 4), (37, 11), (37, 14), (37, 15), (37, 13), (37, 1), (37, 10), (69, 5)]
JULIA_NUM_THREADS=16 julia test_bigint_threads.jl
Threads num 16
GC threads num 8
Num samples 100000000
test_bigint:
19.553159 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(3817175987606, 3), (3857980142929, 2), (3867553456853, 7), (3857999624246, 10), (3843047030148, 5), (3908379562627, 12), (3853920824529, 11), (3882505235686, 15), (3817186011689, 6), (3826734634273, 14), (3858010401490, 13), (3844412127391, 9), (3862081731701, 4), (3770925489528, 8), (3864811087583, 1), (3807658524291, 16)]
test_int:
5.784457 seconds (108 allocations: 9.562 KiB)
results: [(121, 4), (529, 10), (233, 9), (655, 6), (554, 3), (645, 8), (1, 7), (10, 2), (541, 11), (627, 5), (254, 13), (618, 12), (273, 14), (40, 1), (236, 15), (118, 16)]
And here is visualization:
It also looks like there is no loop unroll optimizations.
And as for BigInt, it looks like, considering low CPU utilization that I see during BigInt test, that some sort of global lock is happening.
And if, access to gmp functions is wrapped in some sort of global synchronization, then it is very bad design.
Thank you for help.
I will see if this lib will work for my needs (not only this particular computations)
Just small remark. We need to use uint512 here. And more for intermediate results. Will see what I can do here. Thank you one more time.
And someone got offended.
I just asking not to offend and gaslight me. Zero help was provided by people to whom I am speaking to. So I am asking them not act in offensive ways and be polite.