Threads are not speeding up computation a lot

Benny
Guys. Please. I have no Idea why some of you gazlighting me.
I am coding for more than 20 years on different platforms with different languages, I know how threads scaling and computation work. And what I am seeing here is not how it works.

You could have just written it from the beginning that BigInt in Julia is very slow and thread scaling for them is broken.

Here is related issue from 2013(sic!) BigInt and BigFloat performance is poor · Issue #4176 · JuliaLang/julia · GitHub

And simple demonstration of how broken it is - BigInt(1) == 0 is orders of magnitude slower than iszero(BigInt(1))
Its insane to have such inconsistency and not optimize it.

To summarize this. There is a simple test of simple computations without recursion array or something funny (computations are not equal between int and BigInt, they not to compare relative performance but to create load)

using Base.Threads
using Base.Iterators


function gen_bigint(k_r)
    a = BigInt(0)
    m = BigInt(1234567)
    for i in k_r
        a = mod(BigInt(i)+a, m) + a
    end
    return a,Threads.threadid()
end

function test_bigint(it)
    nth = Threads.nthreads()
    part_size = div(it, nth)
    tasks = []
    for k_r in partition(1:it, part_size)
        t() = gen_bigint(k_r)
        push!(tasks, Threads.@spawn t())
    end

    results = fetch.(tasks)
    return results
end


function gen_int(k_r)
    a = 0
    for i in k_r
        a = i+a^2 - div(a,2)
        a = mod(a, 666)
    end
    return a,Threads.threadid()
end

function test_int(it)
    nth = Threads.nthreads()
    part_size = div(it, nth)
    tasks = []
    for k_r in partition(1:it, part_size)
        t() = gen_int(k_r)
        push!(tasks, Threads.@spawn t())
    end

    results = fetch.(tasks)
    return results
end

nth = Threads.nthreads()
ngc = Threads.ngcthreads() 
num_semples = 100000000

println("Threads num $nth")
println("GC threads num $ngc")
println("Num samples $num_semples")


println("test_bigint:")
test_bigint(1000)
@time results = test_bigint(num_semples)
println("results: $results")

println("test_int:")
test_int(1000)
@time results = test_int(10000000000)
println("results: $results")

Here is full results for 1 to 16 thread scaling:

Summary
JULIA_NUM_THREADS=1 julia test_bigint_threads.jl
Threads num 1
GC threads num 1
Num samples 100000000
test_bigint:
 28.750614 seconds (800.00 M allocations: 16.391 GiB, 26.91% gc time)
results: Tuple{BigInt, Int64}[(61075029928496, 1)]
test_int:
 58.943936 seconds (17 allocations: 1024 bytes)
results: [(118, 1)]
 
JULIA_NUM_THREADS=2 julia test_bigint_threads.jl
Threads num 2
GC threads num 1
Num samples 100000000
test_bigint:
 32.696664 seconds (800.00 M allocations: 16.391 GiB, 26.58% gc time)
results: Tuple{BigInt, Int64}[(30537504084030, 1), (30537506905054, 2)]
test_int:
 30.375354 seconds (23 allocations: 1.547 KiB)
results: [(10, 1), (118, 2)]
 
JULIA_NUM_THREADS=3 julia test_bigint_threads.jl
Threads num 3
GC threads num 1
Num samples 100000000
test_bigint:
 27.283528 seconds (800.00 M allocations: 16.391 GiB, 37.36% gc time)
results: Tuple{BigInt, Int64}[(20358321455018, 1), (20242228944234, 2), (20525274413684, 3), (73, 1)]
test_int:
 20.396126 seconds (35 allocations: 2.641 KiB)
results: [(291, 1), (251, 2), (104, 3), (10, 2)]
 
JULIA_NUM_THREADS=4 julia test_bigint_threads.jl
Threads num 4
GC threads num 2
Num samples 100000000
test_bigint:
 23.888192 seconds (800.00 M allocations: 16.391 GiB, 42.34% gc time)
results: Tuple{BigInt, Int64}[(15268747548911, 1), (15372187425966, 2), (15268757083666, 3), (15448385793992, 4)]
test_int:
 15.385837 seconds (35 allocations: 2.641 KiB)
results: [(655, 1), (10, 2), (618, 3), (118, 4)]
 
JULIA_NUM_THREADS=5 julia test_bigint_threads.jl
Threads num 5
GC threads num 2
Num samples 100000000
test_bigint:
 21.201918 seconds (800.00 M allocations: 16.391 GiB, 45.35% gc time)
results: Tuple{BigInt, Int64}[(12214994302332, 1), (12167108529529, 2), (12167100499553, 3), (12467594719346, 5), (12424022523641, 4)]
test_int:
 12.579772 seconds (41 allocations: 3.188 KiB)
results: [(392, 1), (624, 3), (251, 2), (454, 4), (118, 5)]
 
JULIA_NUM_THREADS=6 julia test_bigint_threads.jl
Threads num 6
GC threads num 3
Num samples 100000000
test_bigint:
 20.473163 seconds (800.00 M allocations: 16.391 GiB, 46.51% gc time)
results: Tuple{BigInt, Int64}[(10179169253058, 1), (10302597595199, 2), (10262591488746, 4), (10262649550577, 5), (10280752043680, 3), (10396906317057, 6), (1061, 2)]
test_int:
 10.784252 seconds (53 allocations: 4.281 KiB)
results: [(559, 1), (392, 4), (434, 6), (624, 2), (10, 5), (251, 3), (37, 3)]
 
JULIA_NUM_THREADS=7 julia test_bigint_threads.jl
Threads num 7
GC threads num 3
Num samples 100000000
test_bigint:
 20.080860 seconds (800.00 M allocations: 16.391 GiB, 46.96% gc time)
results: Tuple{BigInt, Int64}[(8725000975428, 2), (8752963738503, 3), (8725043826323, 1), (8939618798850, 4), (8482380681554, 6), (8883664159759, 5), (8790325776816, 7), (217, 4)]
test_int:
  9.833347 seconds (59 allocations: 4.828 KiB)
results: [(470, 1), (35, 5), (82, 2), (236, 6), (481, 3), (302, 7), (251, 4), (37, 2)]
 
JULIA_NUM_THREADS=8 julia test_bigint_threads.jl
Threads num 8
GC threads num 4
Num samples 100000000
test_bigint:
 20.573446 seconds (800.00 M allocations: 16.391 GiB, 47.62% gc time)
results: Tuple{BigInt, Int64}[(7634363638259, 2), (7735101959209, 3), (7686099648729, 1), (7707850373019, 4), (7634357956960, 5), (7716029625162, 6), (7724218630957, 7), (7729639737538, 8)]
test_int:
  8.708045 seconds (59 allocations: 4.828 KiB)
results: [(529, 2), (655, 4), (645, 3), (10, 1), (627, 5), (618, 6), (40, 8), (118, 7)]
 
JULIA_NUM_THREADS=9 julia test_bigint_threads.jl
Threads num 9
GC threads num 4
Num samples 100000000
test_bigint:
 20.085430 seconds (800.00 M allocations: 16.391 GiB, 47.25% gc time)
results: Tuple{BigInt, Int64}[(6786098342200, 4), (6858742337621, 1), (7035301071999, 2), (6747408293575, 3), (6730499267950, 5), (6892572827717, 7), (6841716913293, 6), (6846612810022, 8), (6998996253406, 9), (73, 4)]
test_int:
  8.987191 seconds (72 allocations: 6.281 KiB)
results: [(20, 3), (392, 4), (291, 2), (624, 5), (125, 7), (251, 1), (279, 6), (454, 9), (104, 8), (10, 8)]
 
JULIA_NUM_THREADS=10 julia test_bigint_threads.jl
Threads num 10
GC threads num 5
Num samples 100000000
test_bigint:
 19.751144 seconds (800.00 M allocations: 16.391 GiB, 48.99% gc time)
results: Tuple{BigInt, Int64}[(6107484437655, 2), (6068298338367, 4), (6083539359841, 6), (6135824798717, 3), (6083535005291, 5), (6107499477187, 7), (6233789800311, 1), (6090096369906, 8), (6212027436953, 10), (6111842616762, 9)]
test_int:
  8.062407 seconds (72 allocations: 6.281 KiB)
results: [(559, 3), (392, 2), (434, 6), (624, 5), (10, 1), (251, 8), (1, 9), (454, 10), (585, 4), (118, 7)]
 
JULIA_NUM_THREADS=11 julia test_bigint_threads.jl
Threads num 11
GC threads num 5
Num samples 100000000
test_bigint:
 19.044380 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(5552264329166, 2), (5599801681625, 4), (5607735359291, 5), (5564132037542, 7), (5556241035402, 3), (5738354832830, 6), (5536444219238, 1), (5581969227487, 8), (5552274537560, 10), (5649247656400, 11), (5601761621436, 9), (73, 6)]
test_int:
  7.425482 seconds (84 allocations: 7.375 KiB)
results: [(23, 2), (554, 4), (586, 5), (55, 3), (336, 7), (73, 9), (266, 8), (446, 1), (143, 10), (307, 11), (104, 6), (10, 4)]
 
JULIA_NUM_THREADS=12 julia test_bigint_threads.jl
Threads num 12
GC threads num 6
Num samples 100000000
test_bigint:
 18.786027 seconds (800.00 M allocations: 16.391 GiB, 48.78% gc time)
results: Tuple{BigInt, Int64}[(5089577758368, 2), (5105908164086, 7), (5151311388087, 5), (5089588764364, 3), (5131296929334, 8), (5171266823968, 6), (5131321962125, 4), (5109561037625, 10), (5140344519212, 1), (5095056451393, 9), (5198439879982, 12), (5334533161236, 11), (1061, 8)]
test_int:
  6.808045 seconds (90 allocations: 7.922 KiB)
results: [(613, 6), (559, 5), (590, 8), (392, 4), (215, 2), (434, 7), (297, 3), (624, 9), (338, 1), (10, 10), (348, 12), (251, 11), (37, 3)]
 
JULIA_NUM_THREADS=13 julia test_bigint_threads.jl
Threads num 13
GC threads num 6
Num samples 100000000
test_bigint:
 18.543839 seconds (800.00 M allocations: 16.391 GiB, 48.72% gc time)
results: Tuple{BigInt, Int64}[(4698059727770, 3), (4641177758222, 2), (4816983273133, 5), (4631066829879, 8), (4739957578823, 4), (4771811346967, 9), (4731617019558, 7), (4812017665843, 11), (4691349323106, 6), (4746661098817, 10), (4795195101698, 1), (4703103281310, 12), (4703141023297, 13), (33717, 3)]
test_int:
  6.434850 seconds (96 allocations: 8.469 KiB)
results: [(550, 2), (652, 7), (148, 6), (378, 3), (84, 8), (3, 9), (447, 4), (492, 5), (477, 10), (595, 11), (268, 1), (251, 12), (279, 13), (75, 13)]
 
JULIA_NUM_THREADS=14 julia test_bigint_threads.jl
Threads num 14
GC threads num 7
Num samples 100000000
test_bigint:
 18.876498 seconds (800.00 M allocations: 16.391 GiB, 48.84% gc time)
results: Tuple{BigInt, Int64}[(4362488084766, 2), (4388959700729, 3), (4376464678471, 4), (4476022660834, 5), (4362525243172, 6), (4491583204000, 7), (4469823990974, 8), (4342271028017, 9), (4241171132169, 11), (4454256192151, 10), (4441848824150, 14), (4384284698990, 1), (4395172375712, 13), (4381167478013, 12), (217, 7)]
test_int:
  6.653705 seconds (102 allocations: 9.016 KiB)
results: [(330, 3), (470, 2), (604, 7), (35, 4), (236, 5), (82, 6), (434, 8), (236, 10), (307, 9), (481, 12), (23, 11), (302, 13), (239, 1), (251, 14), (37, 10)]
 
JULIA_NUM_THREADS=15 julia test_bigint_threads.jl
Threads num 15
GC threads num 7
Num samples 100000000
test_bigint:
 19.011303 seconds (800.00 M allocations: 16.391 GiB, 48.50% gc time)
results: Tuple{BigInt, Int64}[(4071653499242, 4), (4022313532934, 6), (4017958718972, 7), (4115187781471, 8), (4112345312084, 5), (4084773231120, 3), (3980240712510, 2), (4123941346023, 9), (4115181597005, 11), (4184914383280, 10), (4128273187734, 12), (3962835775313, 14), (4071671002179, 13), (4090538393487, 1), (4136986011575, 15), (66485, 7)]
test_int:
  6.137018 seconds (108 allocations: 9.562 KiB)
results: [(37, 5), (37, 6), (37, 2), (37, 8), (37, 3), (37, 12), (37, 9), (37, 7), (37, 4), (37, 11), (37, 14), (37, 15), (37, 13), (37, 1), (37, 10), (69, 5)]
 
JULIA_NUM_THREADS=16 julia test_bigint_threads.jl
Threads num 16
GC threads num 8
Num samples 100000000
test_bigint:
 19.553159 seconds (800.00 M allocations: 16.391 GiB, 48.80% gc time)
results: Tuple{BigInt, Int64}[(3817175987606, 3), (3857980142929, 2), (3867553456853, 7), (3857999624246, 10), (3843047030148, 5), (3908379562627, 12), (3853920824529, 11), (3882505235686, 15), (3817186011689, 6), (3826734634273, 14), (3858010401490, 13), (3844412127391, 9), (3862081731701, 4), (3770925489528, 8), (3864811087583, 1), (3807658524291, 16)]
test_int:
  5.784457 seconds (108 allocations: 9.562 KiB)
results: [(121, 4), (529, 10), (233, 9), (655, 6), (554, 3), (645, 8), (1, 7), (10, 2), (541, 11), (627, 5), (254, 13), (618, 12), (273, 14), (40, 1), (236, 15), (118, 16)]

And here is visualization:


It also looks like there is no loop unroll optimizations.
And as for BigInt, it looks like, considering low CPU utilization that I see during BigInt test, that some sort of global lock is happening.
And if, access to gmp functions is wrapped in some sort of global synchronization, then it is very bad design.

2 Likes