Threads are not speeding up computation a lot

yolhan_mannes · February 25, 2025, 7:47pm

I was exactly looking for that

ufechner7 · February 25, 2025, 7:49pm

Did you try to launch Julia with the parameter:

--gcthreads=8,1

?

8 would be the number of threads for the GC, set it to the number of physical cores of your machine or one or two less.

mbauman · February 25, 2025, 8:08pm

Your code does have side-effects! Just maybe not in the way you typically think of it. It uses memory, and lots of it. You’re doing hundreds of millions of allocations, 8GBs worth of it! Memory — and memory bandwidth — is a shared resource. As is the GC.

It’s not just the matrix itself, but each and every single BigInt is independently allocated. Worse, every computation with a BigInt throws it away and allocates a new one. They’re not the fastest datatype in the world, that’s for sure.

Note that the % GC time doesn’t count time that the threads wait for a GC to happen. Every time you allocate a new object, the GC will check the “memory pressure” to see if it needs to do a collection. If it does, it’ll raise a flag and wait for all the threads to hit a “safe point,” at which point marking (multithreaded) and sweeping (single-threaded) is done. That waiting time isn’t accounted for in the metrics that are printed out but may be where some of this is going.

Oscar_Smith · February 25, 2025, 8:15pm

In this case, time to safepoint should be a non-issue. All the threads are allocating, so they should all be seeing the GC request almost immediately.

yolhan_mannes · February 25, 2025, 8:23pm

Here is the best I could get, I hope I didn’t change the logic
module

module Secp256k1
using BitIntegers
const A = uint256"0"
const B = uint256"7"
const P = uint256"115792089237316195423570985008687907853269984665640564039457584007908834671663"
const N = uint256"115792089237316195423570985008687907852837564279074904382605163141518161494337"
const Gx = uint256"55066263022277343669578718895168534326250603453777594175500187360389116729240"
const Gy = uint256"32670510020758816978083085130507043184471273380659243275938904335757337482424"

const G = (Gx, Gy)

function jacobian_double(p)
    # """
    # Double a point in elliptic curves
    # :param p: Point you want to double
    # :return: Point that represents the sum of First and Second Point
    # """
    
    #:param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod )
    #:param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod Secp256k1.p)
    p_x, p_y, p_z = p


    if p_y == 0
        return uint256"0", uint256"0", uint256"1"
    end

    ysq = mod((p_y*p_y), Secp256k1.P)

    S = mod((4 * p_x * ysq), Secp256k1.P)
    # M = (3 * p_x ** 2 + A * p_z ** 4) % cls.P
    # A is zero
    M = mod((3 * p_x*p_x), Secp256k1.P)
    nx = mod((M*M - 2 * S), Secp256k1.P)
    ny = mod((M * (S - nx) - 8 * ysq*ysq), Secp256k1.P)
    nz = mod((2 * p_y * p_z), Secp256k1.P)

    return nx, ny, nz
end

function jacobian_add(p, q)
    # """
    # Add two points in elliptic curves

    # :param p: First Point you want to add
    # :param q: Second Point you want to add
    # :return: Point that represents the sum of First and Second Point
    # """
    p_x, p_y, p_z = p
    q_x, q_y, q_z = q

    if p_y == 0
        return q_x, q_y, q_z
    end
    if q_y == 0
        return p_x, p_y, p_z
    end

    U1 = mod((p_x * q_z*q_z), Secp256k1.P)
    U2 = mod((q_x * p_z*p_z), Secp256k1.P)
    S1 = mod((p_y * q_z*q_z*q_z), Secp256k1.P)
    S2 = mod((q_y * p_z*p_z*p_z), Secp256k1.P)

    if U1 == U2
        if S1 != S2
            return uint256"0", uint256"0", uint256"1"
        end
        return jacobian_double(p)
    end

    H = U2 - U1
    R = S2 - S1
    H2 = mod((H * H), Secp256k1.P)
    H3 = mod((H * H2), Secp256k1.P)
    U1H2 = mod((U1 * H2), Secp256k1.P)
    nx = mod((R * R - H3 - 2 * U1H2), Secp256k1.P)
    ny = mod((R * (U1H2 - nx) - S1 * H3), Secp256k1.P)
    nz = mod((H * p_z * q_z), Secp256k1.P)

    return nx, ny, nz
end

function jacobian_multiply(p, n)
    # """
    # Multily point and scalar in elliptic curves
    # :param p: First Point to mutiply
    # :param n: Scalar to mutiply
    # :return: Point that represents the sum of First and Second Point
    # """
    p_x, p_y, p_z = p

    if p_y == 0 || n == 0
        return uint256"0", uint256"0", uint256"1"
    end

    if n == 1
        return p
    end

    if n < 0 || n >= Secp256k1.N
        return jacobian_multiply(p, mod(n, Secp256k1.N))
    end

    if mod(n, 2) == 0
        return jacobian_double(jacobian_multiply(p, div(n, 2)))
    end

    pd = jacobian_double(jacobian_multiply(p, div(n, 2)))
    return jacobian_add(pd, p)
end

function to_jacobian(p)
    p_x, p_y = p
    return p_x, p_y, uint256"1"
end

function _inv(x, n)
    # """
    # Extended Euclidean Algorithm. It's the 'division' in elliptic curves

    # :param x: Divisor
    # :param n: Mod for division
    # :return: Value representing the division
    # """
    if x == 0
        return uint256"0"
    end

    lm  = uint256"1"
    hm  = uint256"0"
    low  = mod(x, n)
    high  = n
    r  = uint256"0"
    while low > 1
        r = div(high, low)
        nm = hm - lm * r
        nw = high - low * r
        high = low
        hm = lm
        low = nw
        lm = nm
    end

    return mod(lm, n)
end

function from_jacobian(p)
    # """
    # Convert point back from Jacobian coordinates

    # :param p: First Point you want to add
    # :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod Secp256k1.p)
    # :return: Point in default coordinates
    # """

    p_x, p_y, p_z = p

    z = _inv(p_z, Secp256k1.P)
    x = mod((p_x * z^2), Secp256k1.P)
    y = mod((p_y * z^3), Secp256k1.P)

    return x, y
end


function der_keys(n)
    pubj = jacobian_multiply(to_jacobian(Secp256k1.G), n)
    return from_jacobian(pubj)
end

end

code


# using Base.Threads
using Base.Iterators
using BenchmarkTools
using BitIntegers

include("./secp256k1.jl")

function gen_keys(k_start, k_num)
    k_end = k_start + k_num - 1
    k_range = k_start:k_end
    data_size = length(k_range)
    keys = zeros(UInt256, data_size, 2)
    Threads.@threads for i in eachindex(k_range)
        k = k_range[i]
        x,y = Secp256k1.der_keys(k)
        keys[i,1] = x
        keys[i,2] = y
    end
    return keys
end

function test_gen_keys_range()
    gen_keys(uint256"123",uint256"100000")
end

test_gen_keys_range()
@time test_gen_keys_range();

benchmark

> julia -t 1 --project main.jl
  0.304315 seconds (10 allocations: 6.104 MiB, 11.04% gc time)
> julia -t 20 --project main.jl
  0.050477 seconds (105 allocations: 6.117 MiB)

speedup are insane Bit Integer are wow.
And for the 10000 case (which was 2.096 s and 705.314 ms) now we have

> julia -t 1 --project main.jl
  0.026495 seconds (10 allocations: 625.828 KiB)
> julia -t 20 --project main.jl
  0.006071 seconds (105 allocations: 639.047 KiB)

just tested I get the same exact time with fetching, not linear on my machine but its really memory related (logrange plot)

Benny · February 25, 2025, 11:27pm

Even if we assume that parallelization is not obstructed by any realistic factors like memory pressure, the theoretical maximum speedup is sub-linear when you consider that only a fraction of the program is accelerated (Amdahl’s law). It’s shockingly punishing, a 16x speedup of 95% of the original runtime has an overall ≤9.143x speedup.

Is there not a library for arbitrary precision integers that only allocate if the value exceeds a threshold close to Int (forgot the terminology for these kind of data structures)? This wouldn’t matter if the values are usually higher, though.

joeldahne · February 26, 2025, 2:14am

To iterate on Benny’s comment about Amdahl’s law. In your benchmarks for test_gen_keys2 the runtime with 1 thread is roughly 30 seconds with a 15% gc time. Assuming the gc is not parallelized at all and the rest is perfectly parallelized, then 4 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 4 = 10.875 seconds, which is not too far of from the 13 seconds you get. Similarly 16 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 16 = 6.09375, which again is not too far of from your 9.5 seconds.

One alternative to BitIntegers.jl could be to use BigInt together MutableArithmetic.jl to reduce the number allocations, that would however likely require rather intrusive changes to the code.

The fmpz type in the FLINT library works like this. I do however not think this can be fully leveraged in Julia. it would likely require some sort of finalizer to check if an allocation was made or not, and finalizers can only be attached to mutable types and those are tracked by the gc (in principle one could possibly insert the finalizers inline with some escape analysis, I think Julia can do something like that but I don’t know how flexible it is).

Benny · February 26, 2025, 6:34pm

Looks more or less like how Nemo.jl wraps FLINT’s fmpz: Nemo.jl/src/flint/FlintTypes.jl at master · wbhart/Nemo.jl · GitHub

I was imagining something more on the Julia side, which wouldn’t need finalizers to manage memory allocated in another language:

struct MaybeBig{I<:Integer}
    inline::I # except typemax as overflow flag
    overflow::BigInt # must trail for incomplete new #36789
    MaybeBig{I}(i::I) where I = new(i)
    MaybeBig{I}(i::BigInt) where I = new(typemax(I), i)
end
MaybeBig(i::I) where I = MaybeBig{I}(i)
# then all the Number arithmetic methods

except I’m not actually sure if the incomplete initialization manages zero allocation, hopefully someone can explain if I’m just benchmarking wrong:

julia> @btime MaybeBig($1) # allocates?
  5.500 ns (1 allocation: 32 bytes)
MaybeBig{Int64}(1, #undef)

julia> Base.one(::Type{MaybeBig{I}}) where I = MaybeBig{I}(one(I))

julia> @btime ones($(MaybeBig{Int}), $5) # only allocates Array
  126.988 ns (2 allocations: 96 bytes)
5-element Vector{MaybeBig{Int64}}:
 MaybeBig{Int64}(1, #undef)
 MaybeBig{Int64}(1, #undef)
 MaybeBig{Int64}(1, #undef)
 MaybeBig{Int64}(1, #undef)
 MaybeBig{Int64}(1, #undef)

joeldahne · February 26, 2025, 6:56pm

The ones function only allocates one instance of one, so you get one allocation for the array and one for the single allocated element. For example for BigInt we have

julia> xs = ones(BigInt, 2)
2-element Vector{BigInt}:
 1
 1

julia> xs[1] === xs[2]
true

I don’t know if there is any way around the allocation, I think the gc needs to be involved to keep track of the #undef as well.

Benny · February 26, 2025, 7:18pm

Good catch on the fill-ing because I used 5 assuming each would add an allocation, but there’s actually 2+ allocations for Arrays wrapping Memory buffers in v1.11+, which suggests to me that ones(MaybeBig{Int}, ...) isn’t allocating the one instance. It seems to be corroborated by benchmarking BigInt arrays:

julia> @btime BigInt($1)
  57.099 ns (2 allocations: 40 bytes)
1

julia> @btime Vector{BigInt}(undef, $2)
  23.015 ns (2 allocations: 80 bytes)
2-element Vector{BigInt}:
 #undef
 #undef

julia> @btime Vector{BigInt}(undef, $100); # still 2 allocations for 100 undef
  65.164 ns (2 allocations: 928 bytes)

julia> @btime ones(BigInt, $2) # 1 BigInt filling 1 2-Array adds up
  82.070 ns (4 allocations: 120 bytes)
2-element Vector{BigInt}:
 1
 1

joeldahne · February 26, 2025, 7:59pm

Ah, good point! I forgot about the double allocation from Memory. It seems like the allocation is avoided due to constant folding, so for example

julia> five() = MaybeBig{Int}(5)
five (generic function with 1 method)

julia> @btime five()
  0.894 ns (0 allocations: 0 bytes)
MaybeBig{Int64}(5, #undef)

doesn’t allocate either. It is still unclear to me if the allocation could be avoided in the general case, or if it relies upon the value being preallocated by the compiler. But at least the situation is better than I thought!

souma · February 27, 2025, 5:24am

This is a beautiful demonstration of my perspective on optimization hunting in Julia. The satisfaction of taking this (relatively) huge allocated memory intensive operation and boiling it down to 10 allocations is just :chefs_kiss:

yolhan_mannes · February 27, 2025, 6:54am

And the 100X speed up is pleasant too

Nyan · February 27, 2025, 9:16am

No difference.

Palli · February 27, 2025, 11:22am

That 4.4x is nice, though not the ideal 20x. But why do you have any allocations at all? And why more with threading? It could be your limiter. I’m not sure but Bumper.jl could help, I still though with you switching to a bittype, it wouldn’t be needed, already deallocated early. Then you can allocate in a hot loop and it will get freed immediately, stack-like, likely why Python is fast with its RC-GC, but should be even faster with Bumper. Bumper seems to support threads, but I’ve not looked closely at it. BigInt and BigFloat are performance killers, both have better alternatives (in most cases, but since they’re there, they’re the go-to for many people), as you’ve discovered, why I want them gone from standard Julia…

@yolhan_mannes, this is a bit strange:

> julia -t 1 --project main.jl
  0.304315 seconds (10 allocations: 6.104 MiB, 11.04% gc time)

That GC time, it doesn’t happen with 20 threads. And is also bependent on other, when also with just 1 thread and same about of constant overhead 10 allocations.

[A recent change added 1 interactive thread as the new default, but for current versions, you may want to get in the habit of -t 20,1 or -t auto,1. I doubt though it matters in this case.]

yolhan_mannes · February 27, 2025, 11:51am

Allocation are only comming from the original array created and the fact that a fetch will always infer to Any which also explains why it grows with the number of thread. Thread are limited by memory bandwidth, Julia only tells you heap allocations not stack ones and uint256 takes quite some stacks which explains why thread starts to wait sometimes and stop making the performance better. You can try running the code with a faster processor than mine (I have a 8go ram ) you will get a lot better results

Bumper may get rid of Any inference but it won’t make my ram bigger since serial version of the code (without the macro) won’t allocate at all.

Also, sometimes you actually want a BigInt I think because you can use it as a Ref and code like it was a 1 element array which is hard but may be even better than my version with BitInteger. Heap is not always an enemy and it can reduce stack pressure a lot when well used.

Nyan · February 27, 2025, 1:17pm

Benny
Guys. Please. I have no Idea why some of you gazlighting me.
I am coding for more than 20 years on different platforms with different languages, I know how threads scaling and computation work. And what I am seeing here is not how it works.

You could have just written it from the beginning that BigInt in Julia is very slow and thread scaling for them is broken.

Here is related issue from 2013(sic!) BigInt and BigFloat performance is poor · Issue #4176 · JuliaLang/julia · GitHub

And simple demonstration of how broken it is - BigInt(1) == 0 is orders of magnitude slower than iszero(BigInt(1))
Its insane to have such inconsistency and not optimize it.

To summarize this. There is a simple test of simple computations without recursion array or something funny (computations are not equal between int and BigInt, they not to compare relative performance but to create load)

using Base.Threads
using Base.Iterators


function gen_bigint(k_r)
    a = BigInt(0)
    m = BigInt(1234567)
    for i in k_r
        a = mod(BigInt(i)+a, m) + a
    end
    return a,Threads.threadid()
end

function test_bigint(it)
    nth = Threads.nthreads()
    part_size = div(it, nth)
    tasks = []
    for k_r in partition(1:it, part_size)
        t() = gen_bigint(k_r)
        push!(tasks, Threads.@spawn t())
    end

    results = fetch.(tasks)
    return results
end


function gen_int(k_r)
    a = 0
    for i in k_r
        a = i+a^2 - div(a,2)
        a = mod(a, 666)
    end
    return a,Threads.threadid()
end

function test_int(it)
    nth = Threads.nthreads()
    part_size = div(it, nth)
    tasks = []
    for k_r in partition(1:it, part_size)
        t() = gen_int(k_r)
        push!(tasks, Threads.@spawn t())
    end

    results = fetch.(tasks)
    return results
end

nth = Threads.nthreads()
ngc = Threads.ngcthreads() 
num_semples = 100000000

println("Threads num $nth")
println("GC threads num $ngc")
println("Num samples $num_semples")


println("test_bigint:")
test_bigint(1000)
@time results = test_bigint(num_semples)
println("results: $results")

println("test_int:")
test_int(1000)
@time results = test_int(10000000000)
println("results: $results")

Here is full results for 1 to 16 thread scaling:

Summary

JULIA_NUM_THREADS=1 julia test_bigint_threads.jl
Threads num 1
GC threads num 1
Num samples 100000000
test_bigint:
 28.750614 seconds (800.00 M allocations: 16.391 GiB, 26.91% gc time)
results: Tuple{BigInt, Int64}[(61075029928496, 1)]
test_int:
 58.943936 seconds (17 allocations: 1024 bytes)
results: [(118, 1)]
 
JULIA_NUM_THREADS=2 julia test_bigint_threads.jl
Threads num 2
GC threads num 1
Num samples 100000000
test_bigint:
 32.696664 seconds (800.00 M allocations: 16.391 GiB, 26.58% gc time)
results: Tuple{BigInt, Int64}[(30537504084030, 1), (30537506905054, 2)]
test_int:
 30.375354 seconds (23 allocations: 1.547 KiB)
results: [(10, 1), (118, 2)]
 
JULIA_NUM_THREADS=3 julia test_bigint_threads.jl
Threads num 3
GC threads num 1
Num samples 100000000
test_bigint:
 27.283528 seconds (800.00 M allocations: 16.391 GiB, 37.36% gc time)
results: Tuple{BigInt, Int64}[(20358321455018, 1), (20242228944234, 2), (20525274413684, 3), (73, 1)]
test_int:
 20.396126 seconds (35 allocations: 2.641 KiB)
results: [(291, 1), (251, 2), (104, 3), (10, 2)]
 
JULIA_NUM_THREADS=4 julia test_bigint_threads.jl
Threads num 4
GC threads num 2
Num samples 100000000
test_bigint:
 23.888192 seconds (800.00 M allocations: 16.391 GiB, 42.34% gc time)
results: Tuple{BigInt, Int64}[(15268747548911, 1), (15372187425966, 2), (15268757083666, 3), (15448385793992, 4)]
test_int:
 15.385837 seconds (35 allocations: 2.641 KiB)
results: [(655, 1), (10, 2), (618, 3), (118, 4)]
 
JULIA_NUM_THREADS=5 julia test_bigint_threads.jl
Threads num 5
GC threads num 2
Num samples 100000000
test_bigint:
 21.201918 seconds (800.00 M allocations: 16.391 GiB, 45.35% gc time)
results: Tuple{BigInt, Int64}[(12214994302332, 1), (12167108529529, 2), (12167100499553, 3), (12467594719346, 5), (12424022523641, 4)]
test_int:
 12.579772 seconds (41 allocations: 3.188 KiB)
results: [(392, 1), (624, 3), (251, 2), (454, 4), (118, 5)]
 
JULIA_NUM_THREADS=6 julia test_bigint_threads.jl
Threads num 6
GC threads num 3
Num samples 100000000
test_bigint:
 20.473163 seconds (800.00 M allocations: 16.391 GiB, 46.51% gc time)
results: Tuple{BigInt, Int64}[(10179169253058, 1), (10302597595199, 2), (10262591488746, 4), (10262649550577, 5), (10280752043680, 3), (10396906317057, 6), (1061, 2)]
test_int:
 10.784252 seconds (53 allocations: 4.281 KiB)
results: [(559, 1), (392, 4), (434, 6), (624, 2), (10, 5), (251, 3), (37, 3)]
 
JULIA_NUM_THREADS=7 julia test_bigint_threads.jl
Threads num 7
GC threads num 3
Num samples 100000000
test_bigint:
 20.080860 seconds (800.00 M allocations: 16.391 GiB, 46.96% gc time)
results: Tuple{BigInt, Int64}[(8725000975428, 2), (8752963738503, 3), (8725043826323, 1), (8939618798850, 4), (8482380681554, 6), (8883664159759, 5), (8790325776816, 7), (217, 4)]
test_int:
  9.833347 seconds (59 allocations: 4.828 KiB)
results: [(470, 1), (35, 5), (82, 2), (236, 6), (481, 3), (302, 7), (251, 4), (37, 2)]
 
JULIA_NUM_THREADS=8 julia test_bigint_threads.jl
Threads num 8
GC threads num 4
Num samples 100000000
test_bigint:
 20.573446 seconds (800.00 M allocations: 16.391 GiB, 47.62% gc time)
results: Tuple{BigInt, Int64}[(7634363638259, 2), (7735101959209, 3), (7686099648729, 1), (7707850373019, 4), (7634357956960, 5), (7716029625162, 6), (7724218630957, 7), (7729639737538, 8)]
test_int:
  8.708045 seconds (59 allocations: 4.828 KiB)
results: [(529, 2), (655, 4), (645, 3), (10, 1), (627, 5), (618, 6), (40, 8), (118, 7)]
 
JULIA_NUM_THREADS=9 julia test_bigint_threads.jl
Threads num 9
GC threads num 4
Num samples 100000000
test_bigint:
 20.085430 seconds (800.00 M allocations: 16.391 GiB, 47.25% gc time)
results: Tuple{BigInt, Int64}[(6786098342200, 4), (6858742337621, 1), (7035301071999, 2), (6747408293575, 3), (6730499267950, 5), (6892572827717, 7), (6841716913293, 6), (6846612810022, 8), (6998996253406, 9), (73, 4)]
test_int:
  8.987191 seconds (72 allocations: 6.281 KiB)
results: [(20, 3), (392, 4), (291, 2), (624, 5), (125, 7), (251, 1), (279, 6), (454, 9), (104, 8), (10, 8)]
 
JULIA_NUM_THREADS=10 julia test_bigint_threads.jl
Threads num 10
GC threads num 5
Num samples 100000000
test_bigint:
 19.751144 seconds (800.00 M allocations: 16.391 GiB, 48.99% gc time)
results: Tuple{BigInt, Int64}[(6107484437655, 2), (6068298338367, 4), (6083539359841, 6), (6135824798717, 3), (6083535005291, 5), (6107499477187, 7), (6233789800311, 1), (6090096369906, 8), (6212027436953, 10), (6111842616762, 9)]
test_int:
  8.062407 seconds (72 allocations: 6.281 KiB)
results: [(559, 3), (392, 2), (434, 6), (624, 5), (10, 1), (251, 8), (1, 9), (454, 10), (585, 4), (118, 7)]
 
JULIA_NUM_THREADS=11 julia test_bigint_threads.jl
Threads num 11
GC threads num 5
Num samples 100000000
test_bigint:
 19.044380 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(5552264329166, 2), (5599801681625, 4), (5607735359291, 5), (5564132037542, 7), (5556241035402, 3), (5738354832830, 6), (5536444219238, 1), (5581969227487, 8), (5552274537560, 10), (5649247656400, 11), (5601761621436, 9), (73, 6)]
test_int:
  7.425482 seconds (84 allocations: 7.375 KiB)
results: [(23, 2), (554, 4), (586, 5), (55, 3), (336, 7), (73, 9), (266, 8), (446, 1), (143, 10), (307, 11), (104, 6), (10, 4)]
 
JULIA_NUM_THREADS=12 julia test_bigint_threads.jl
Threads num 12
GC threads num 6
Num samples 100000000
test_bigint:
 18.786027 seconds (800.00 M allocations: 16.391 GiB, 48.78% gc time)
results: Tuple{BigInt, Int64}[(5089577758368, 2), (5105908164086, 7), (5151311388087, 5), (5089588764364, 3), (5131296929334, 8), (5171266823968, 6), (5131321962125, 4), (5109561037625, 10), (5140344519212, 1), (5095056451393, 9), (5198439879982, 12), (5334533161236, 11), (1061, 8)]
test_int:
  6.808045 seconds (90 allocations: 7.922 KiB)
results: [(613, 6), (559, 5), (590, 8), (392, 4), (215, 2), (434, 7), (297, 3), (624, 9), (338, 1), (10, 10), (348, 12), (251, 11), (37, 3)]
 
JULIA_NUM_THREADS=13 julia test_bigint_threads.jl
Threads num 13
GC threads num 6
Num samples 100000000
test_bigint:
 18.543839 seconds (800.00 M allocations: 16.391 GiB, 48.72% gc time)
results: Tuple{BigInt, Int64}[(4698059727770, 3), (4641177758222, 2), (4816983273133, 5), (4631066829879, 8), (4739957578823, 4), (4771811346967, 9), (4731617019558, 7), (4812017665843, 11), (4691349323106, 6), (4746661098817, 10), (4795195101698, 1), (4703103281310, 12), (4703141023297, 13), (33717, 3)]
test_int:
  6.434850 seconds (96 allocations: 8.469 KiB)
results: [(550, 2), (652, 7), (148, 6), (378, 3), (84, 8), (3, 9), (447, 4), (492, 5), (477, 10), (595, 11), (268, 1), (251, 12), (279, 13), (75, 13)]
 
JULIA_NUM_THREADS=14 julia test_bigint_threads.jl
Threads num 14
GC threads num 7
Num samples 100000000
test_bigint:
 18.876498 seconds (800.00 M allocations: 16.391 GiB, 48.84% gc time)
results: Tuple{BigInt, Int64}[(4362488084766, 2), (4388959700729, 3), (4376464678471, 4), (4476022660834, 5), (4362525243172, 6), (4491583204000, 7), (4469823990974, 8), (4342271028017, 9), (4241171132169, 11), (4454256192151, 10), (4441848824150, 14), (4384284698990, 1), (4395172375712, 13), (4381167478013, 12), (217, 7)]
test_int:
  6.653705 seconds (102 allocations: 9.016 KiB)
results: [(330, 3), (470, 2), (604, 7), (35, 4), (236, 5), (82, 6), (434, 8), (236, 10), (307, 9), (481, 12), (23, 11), (302, 13), (239, 1), (251, 14), (37, 10)]
 
JULIA_NUM_THREADS=15 julia test_bigint_threads.jl
Threads num 15
GC threads num 7
Num samples 100000000
test_bigint:
 19.011303 seconds (800.00 M allocations: 16.391 GiB, 48.50% gc time)
results: Tuple{BigInt, Int64}[(4071653499242, 4), (4022313532934, 6), (4017958718972, 7), (4115187781471, 8), (4112345312084, 5), (4084773231120, 3), (3980240712510, 2), (4123941346023, 9), (4115181597005, 11), (4184914383280, 10), (4128273187734, 12), (3962835775313, 14), (4071671002179, 13), (4090538393487, 1), (4136986011575, 15), (66485, 7)]
test_int:
  6.137018 seconds (108 allocations: 9.562 KiB)
results: [(37, 5), (37, 6), (37, 2), (37, 8), (37, 3), (37, 12), (37, 9), (37, 7), (37, 4), (37, 11), (37, 14), (37, 15), (37, 13), (37, 1), (37, 10), (69, 5)]
 
JULIA_NUM_THREADS=16 julia test_bigint_threads.jl
Threads num 16
GC threads num 8
Num samples 100000000
test_bigint:
 19.553159 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(3817175987606, 3), (3857980142929, 2), (3867553456853, 7), (3857999624246, 10), (3843047030148, 5), (3908379562627, 12), (3853920824529, 11), (3882505235686, 15), (3817186011689, 6), (3826734634273, 14), (3858010401490, 13), (3844412127391, 9), (3862081731701, 4), (3770925489528, 8), (3864811087583, 1), (3807658524291, 16)]
test_int:
  5.784457 seconds (108 allocations: 9.562 KiB)
results: [(121, 4), (529, 10), (233, 9), (655, 6), (554, 3), (645, 8), (1, 7), (10, 2), (541, 11), (627, 5), (254, 13), (618, 12), (273, 14), (40, 1), (236, 15), (118, 16)]

And here is visualization:

It also looks like there is no loop unroll optimizations.
And as for BigInt, it looks like, considering low CPU utilization that I see during BigInt test, that some sort of global lock is happening.
And if, access to gmp functions is wrapped in some sort of global synchronization, then it is very bad design.

Nyan · February 27, 2025, 1:24pm

Thank you for help.
I will see if this lib will work for my needs (not only this particular computations)

Nyan · February 27, 2025, 1:56pm

Just small remark. We need to use uint512 here. And more for intermediate results. Will see what I can do here. Thank you one more time.

Nyan · February 27, 2025, 2:02pm

And someone got offended.
I just asking not to offend and gaslight me. Zero help was provided by people to whom I am speaking to. So I am asking them not act in offensive ways and be polite.

Topic		Replies	Views
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1994	January 20, 2023
Poor BigInt performance Performance	32	1311	May 14, 2024
Strange slowdown with @threads :greedy with BigInts Performance	7	112	September 9, 2025
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3984	February 11, 2019
Default sin / cos functions do not scale on multithreading Performance parallel , multithreading	19	1555	July 13, 2018

Threads are not speeding up computation a lot

Related topics