Threads are not speeding up computation a lot

Hi. I googled and read many things about this topic, but cannot figure up what is the problem, and why I see no meaningful speedup for my code.

My code:

module Keygen
using Base.Threads
using Base.Iterators
include("./secp256k1.jl")

function gen_keys_range(k_range, out_matrix=nothing)
    if isnothing(out_matrix)
        data_size = length(k_range)
        keys = zeros(BigInt, data_size, 2)
    else
        keys = out_matrix
    end
    for (i, k) in enumerate(k_range)
        x,y = Secp256k1.der_keys(k)
        keys[i,1] = x
        keys[i,2] = y
    end
    return keys, k_range
end

function gen_keys_1(k_start, k_num)
    nth = Threads.nthreads()
    keys = zeros(BigInt, k_num, 2)
    k_end = k_start + k_num - 1
    part_size = div(k_num, nth)
    tasks = []
    
    for k_r in partition(k_start:k_end, part_size)
        push!(tasks, Threads.@spawn gen_keys_range(k_r))
    end

    results = fetch.(tasks)
    for res in results
        k, k_r = res
        v_start = k_r.start - k_start + 1
        v_stop = k_r.stop - k_start + 1
        keys[v_start:v_stop,:] = k
    end

    return keys

end

nth = Threads.nthreads()
println("Threads num $nth")

function test_gen_keys_range()
    gen_keys_1(123,10000)
end


start_t = time();
keys = test_gen_keys_range()
stop_t = time();
elapsed_time = stop_t - start_t;
println("Time taken by test_gen_keys_range: ", elapsed_time, " seconds");

end

Results:

JULIA_NUM_THREADS=1 julia keygen.jl
Threads num 1
Time taken by test_gen_keys_range: 1.7290940284729004 seconds
JULIA_NUM_THREADS=16 julia keygen.jl
Threads num 16
Time taken by test_gen_keys_range: 1.2138419151306152 seconds

I tried different approaches, rewrite code few times. Tried

Threads.@threads

And getting same results every time.
I see constant small speed independent on number threads used after 2 threads and more.

(Small constant speedup)
And at this point I am feeling as I am fighting with windmills.
This things should just works.

Ah. I am using Arch Linux and julia 2:1.11.3-2 from repos

well, that is a speedup!

but note that the way you are timing will include the compilation time of test_gen_keys_range(). you can avoid this by calling it once to trigger compilation. there is also the @time macro to make these things easier

I also don’t see proper CPU cores load. It looks always the same.

Yes. I tried this also.
I removed code duplicates for simplicity.
I see no meaningful change in results.
I also testing with 1000_000 keys so most time is going to actual computing - results the same only 100x longer (130 seconds)

is it possible this module contains some global lock that’s forcing single threading?

if you provide a complete, copy-pasteable example it will be easier to help. you may also be interested in using GitHub - JuliaFolds2/OhMyThreads.jl: Simple multithreading in julia which simplifies many multithreading problems.

Welcome!

Your “inner kernel” — that gen_keys_range function is doing lots and lots of allocations. Both BigInts themselves and the matrix of them. You’re likely limited by memory bandwidth and/or the GC.

Threads aren’t guaranteed to give speedups — it’ll only speed things up if you’re fully CPU-constrained. In many cases there are other things going on — memory, allocations, communication overheads, etc, etc.

15 Likes

here is an attempt :


using Base.Threads
using Base.Iterators
using BenchmarkTools
# include("./secp256k1.jl")

function gen_keys_range(k_range, out_matrix=nothing)
    if isnothing(out_matrix)
        data_size = length(k_range)
        keys = zeros(BigInt, data_size, 2)
    else
        keys = out_matrix
    end
    for (i, k) in enumerate(k_range)
        x,y = big(1), big(1)
        keys[i,1] = x
        keys[i,2] = y
    end
    return keys, k_range
end

function gen_keys_1(k_start, k_num)
    nth = Threads.nthreads()
    keys = zeros(BigInt, k_num, 2)
    k_end = k_start + k_num - 1
    part_size = div(k_num, nth)
    tasks = [Threads.@spawn(gen_keys_range(k_r)) for k_r in partition(k_start:k_end, part_size)]
    results = fetch.(tasks)
    for res in results
        k, k_r = res
        v_start = k_r.start - k_start + 1
        v_stop = k_r.stop - k_start + 1
        @views keys[v_start:v_stop,:] = k
    end

    return keys

end

nth = Threads.nthreads()
println("Threads num $nth")

function test_gen_keys_range()
    gen_keys_1(123,10000)
end


@btime test_gen_keys_range()
""

giving

> julia -t 1 --project main.jl
Threads num 1
  655.200 ÎĽs (40045 allocations: 1.07 MiB)
> julia -t auto --project main.jl
Threads num 20
  256.800 ÎĽs (40649 allocations: 1.09 MiB)

I don’t have your file so I just made two bigint for the test, results are stable after 4 threads because of memory

I tried to pass views of preallocated array - same result. I go to allocate partial result because I thought that problem may be in synchronization of something.

Thank you very much for looking into this.

The secp256k1.jl is pure Julia python code rewrite, its quite simple in nature.
Code under the spoiler:

Summary
module Secp256k1

const A = BigInt(0x0000000000000000000000000000000000000000000000000000000000000000)
const B = BigInt(0x0000000000000000000000000000000000000000000000000000000000000007)
const P = BigInt(0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f)
const N = BigInt(0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141)
const Gx = BigInt(0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798)
const Gy = BigInt(0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8)

const G = (Gx, Gy)

function reminder(a, b) 
    r = a - (a // b) * b
    return r
end

function jacobian_double(p)
    # """
    # Double a point in elliptic curves
    # :param p: Point you want to double
    # :return: Point that represents the sum of First and Second Point
    # """
    
    #:param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    #:param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod p)
    p_x, p_y, p_z = p


    if p_y == 0
        return 0, 0, 0
    end

    ysq = mod((p_y^2), Secp256k1.P)

    S = mod((4 * p_x * ysq), Secp256k1.P)
    # M = (3 * p_x ** 2 + A * p_z ** 4) % cls.P
    # A is zero
    M = mod((3 * p_x^2), Secp256k1.P)
    nx = mod((M^2 - 2 * S), Secp256k1.P)
    ny = mod((M * (S - nx) - 8 * ysq^2), Secp256k1.P)
    nz = mod((2 * p_y * p_z), Secp256k1.P)

    return nx, ny, nz
end

function jacobian_add(p, q)
    # """
    # Add two points in elliptic curves

    # :param p: First Point you want to add
    # :param q: Second Point you want to add
    # :return: Point that represents the sum of First and Second Point
    # """
    p_x, p_y, p_z = p
    q_x, q_y, q_z = q

    if p_y == 0
        return q_x, q_y, q_z
    end
    if q_y == 0
        return p_x, p_y, p_z
    end

    U1 = mod((p_x * q_z^2), Secp256k1.P)
    U2 = mod((q_x * p_z^2), Secp256k1.P)
    S1 = mod((p_y * q_z^3), Secp256k1.P)
    S2 = mod((q_y * p_z^3), Secp256k1.P)

    if U1 == U2
        if S1 != S2
            return 0, 0, 1
        end
        return jacobian_double(p)
    end

    H = U2 - U1
    R = S2 - S1
    H2 = mod((H * H), Secp256k1.P)
    H3 = mod((H * H2), Secp256k1.P)
    U1H2 = mod((U1 * H2), Secp256k1.P)
    nx = mod((R^2 - H3 - 2 * U1H2), Secp256k1.P)
    ny = mod((R * (U1H2 - nx) - S1 * H3), Secp256k1.P)
    nz = mod((H * p_z * q_z), Secp256k1.P)

    return nx, ny, nz
end

function jacobian_multiply(p, n)
    # """
    # Multily point and scalar in elliptic curves
    # :param p: First Point to mutiply
    # :param n: Scalar to mutiply
    # :return: Point that represents the sum of First and Second Point
    # """
    p_x, p_y, p_z = p

    if p_y == 0 || n == 0
        return 0, 0, 1
    end

    if n == 1
        return p
    end

    if n < 0 || n >= Secp256k1.N
        return jacobian_multiply(p, mod(n, Secp256k1.N))
    end

    if mod(n, 2) == 0
        return jacobian_double(jacobian_multiply(p, div(n, 2)))
    end

    pd = jacobian_double(jacobian_multiply(p, div(n, 2)))
    return jacobian_add(pd, p)
end

function to_jacobian(p)
    p_x, p_y = p
    return p_x, p_y, 1
end

function _inv(x, n)
    # """
    # Extended Euclidean Algorithm. It's the 'division' in elliptic curves

    # :param x: Divisor
    # :param n: Mod for division
    # :return: Value representing the division
    # """
    if x == 0
        return BigInt(0)
    end

    lm = 1
    hm = 0
    low = mod(x, n)
    high = n

    while low > 1
        r = div(high, low)
        nm = hm - lm * r
        nw = high - low * r
        high = low
        hm = lm
        low = nw
        lm = nm
    end

    return mod(lm, n)
end

function from_jacobian(p)
    # """
    # Convert point back from Jacobian coordinates

    # :param p: First Point you want to add
    # :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p)
    # :return: Point in default coordinates
    # """

    p_x, p_y, p_z = p

    z = _inv(p_z, Secp256k1.P)
    x = mod((p_x * z^2), Secp256k1.P)
    y = mod((p_y * z^3), Secp256k1.P)

    return x, y
end


function der_keys(n)
    pubj = jacobian_multiply(to_jacobian(Secp256k1.G), n)
    return from_jacobian(pubj)
end

end

My guess was that maybe variables in module makes it syncronize something, so I made them constants.

the only performance issue i see is r = a - (a // b) * b creating rationals for no reason do r = a - div(a , b) * b. If this is called a lot its bad. with your file and the fix I told you I get with 20 threads
705.314 ms (38421571 allocations: 916.12 MiB) while with 1 thread 2.096 s (38420967 allocations: 916.10 MiB)

1 Like

Oh, I missed this piece of code completely when doing rewrite.
I will check.

(I am in a process of testing your code now)

yes @btime will run the function a couple of time to avoid timing compilation if you want to time it replace it with @time

Ah yes, this function never got used by anything. It can be deleted.

I made some changes to code. New version:

module Keygen
using Base.Threads
using Base.Iterators
include("./secp256k1.jl")

function gen_keys_range(k_range, out_matrix=nothing)
    if isnothing(out_matrix)
        data_size = length(k_range)
        keys = zeros(BigInt, data_size, 2)
    else
        keys = out_matrix
    end
    for (i, k) in enumerate(k_range)
        x,y = Secp256k1.der_keys(k)
        keys[i,1] = x
        keys[i,2] = y
    end
    return keys, k_range
end

function gen_keys_0(k_start, k_num)
    keys = zeros(BigInt, k_num, 2)
    nth = Threads.nthreads()
    k_end = k_start + k_num - 1
    part_size = div(k_num, nth)
    tasks = []
    for k_r in partition(k_start:k_end, part_size)
        v_start = k_r.start - k_start + 1
        v_stop = k_r.stop - k_start + 1
        @views keys_view = keys[v_start:v_stop, :]
        push!(tasks, Threads.@spawn gen_keys_range(k_r, keys_view))
    end

    wait.(tasks)

    return keys

end

function gen_keys_1(k_start, k_num)
    nth = Threads.nthreads()
    keys = zeros(BigInt, k_num, 2)
    k_end = k_start + k_num - 1
    part_size = div(k_num, nth)
    tasks = []

    for k_r in partition(k_start:k_end, part_size)
        push!(tasks, Threads.@spawn gen_keys_range(k_r))
    end

    results = fetch.(tasks)
    for res in results
        k, k_r = res
        v_start = k_r.start - k_start + 1
        v_stop = k_r.stop - k_start + 1
        keys[v_start:v_stop,:] = k
    end

    return keys

end

function gen_keys_2(k_start, k_num)
    nth = Threads.nthreads()
    keys = zeros(BigInt, k_num, 2)
    k_end = k_start + k_num - 1
    part_size = div(k_num, nth)
    tasks = [Threads.@spawn(gen_keys_range(k_r)) for k_r in partition(k_start:k_end, part_size)]
    results = fetch.(tasks)
    for res in results
        k, k_r = res
        v_start = k_r.start - k_start + 1
        v_stop = k_r.stop - k_start + 1
        @views keys[v_start:v_stop,:] = k
    end

    return keys

end

nth = Threads.nthreads()
println("Threads num $nth")


function test_gen_keys0()
    gen_keys_0(123,100000)
end

test_gen_keys0()
@time test_gen_keys0()

function test_gen_keys1()
    gen_keys_1(123,100000)
end

test_gen_keys1()
@time test_gen_keys1()

function test_gen_keys2()
    gen_keys_2(123,100000)
end

test_gen_keys2()
@time test_gen_keys2()

end

Results:

Threads num 1
 19.447613 seconds (337.12 M allocations: 8.414 GiB, 20.14% gc time)
 24.110456 seconds (337.12 M allocations: 8.416 GiB, 17.54% gc time)
 29.227764 seconds (337.12 M allocations: 8.416 GiB, 15.77% gc time)

Threads num 4
 12.691396 seconds (337.12 M allocations: 8.414 GiB, 36.10% gc time)
 11.250642 seconds (337.12 M allocations: 8.416 GiB, 40.23% gc time)
 13.027368 seconds (337.12 M allocations: 8.416 GiB, 37.02% gc time)

Threads num 16
  9.462348 seconds (337.12 M allocations: 8.414 GiB, 40.88% gc time)
  9.219033 seconds (337.12 M allocations: 8.416 GiB, 42.11% gc time)
  9.425437 seconds (337.12 M allocations: 8.416 GiB, 42.99% gc time)

Whatever is happening here, I still expect linear speedup. Even if 42% is GC time, it still must be 8x at least.

Is there a way I can tune GC for such workload?

1 Like

King of aren’t guaranteed, but with code without side effects it usually is.
For some reason I am thinking that equivalent python code will scale in linear manner.

Can you suggest the way to debug this issue and find what is preventing my code from scaling? Any GC tuning?

The issue with gc here is actually that we convert Int to BigInt a little bit everywhere and that’s why its such a mess in allocation, I’m looking on how to avoid that but it’s quite hard, can you make a Int version → paralel → bigint form because we could make the Int version non-allocated any memory, then the paralel form will be sure to only allocated for thread spawn and then only we could look for bigint case. I say that because profiling shows that _inv is the function taking most time and its flamegraph is just convert on convert btime Secp256k1._inv($aa,$aa)
205.812 ns (12 allocations: 240 bytes)

2 Likes

Ok. I will try to make pure int and pure BigInt version.
Will report back when finish.
(but without BigInt its king of useless for me anyway)

i know that don’t worry but I strugle a lot to make it non-allocating with them, and I could see where we should cache BigInts to obtimise (kinda like programing in Vlang)

To speed this up, you might want to try using GitHub - rfourquet/BitIntegers.jl: Fixed-width integers similar to builtin ones. It provides 256 and 512 bit integers which can be significantly faster than arbitrary precision ones.

7 Likes