Why does julia use thousands of cpu-hours to compute 1+2?

moble · August 8, 2021, 8:17pm

I’m fairly new to julia, and not overly experienced with scaling to huge numbers of processes, so forgive me if I’m missing something obvious.

Alright, to be honest, I actually used a couple thousand cpu-hours to run using DoubleFloats, but it looks like 1+2 would have done the same. From my testing, the time taken to call

@everywhere <basically any fast piece of code>

scales linearly with the number of processes (at least the first time it’s called), and the constant in front of that scaling is ~1 second, even for something as simple as

@everywhere 1+2

But that linear wallclock scaling means that the cpu-time spent on this call scales quadratically. Now, that ~1 second may not sound like much, but it can easily get to ~1 hour (wallclock) with a few thousand processes — which is thousands of cpu-hours.

Obviously I run into this mostly with @everywhere using WhateverPackage, where the package can be something as simple as Base. But it doesn’t matter whether DEPOT_PATH points to a local SSD for each worker or some slow home directory.

I didn’t realize this was going to happen, saw that my code worked nicely on a few dozen processes, and threw it onto a few thousand processes (several dozen nodes with 56 cores each, and one process per core), which took about an hour to get past the first @everywhere using DoubleFloats, thereby wasting a couple thousand cpu-hours on the first occurrence of @everywhere!!! I can’t really afford to waste this much compute time for each run, so this makes julia unusable for what I need here. (Making up the difference with threading isn’t really in the cards.)

Am I missing something? Can I prime my workers to know how to use @everywhere or distribute this preparation so that it doesn’t go quadratically? Can I somehow use pmap and friends without @everywhere using SomePackage? Or is this just somewhere julia won’t be able to reach?

moble · August 8, 2021, 10:02pm

See also here and here.

moble · August 14, 2021, 4:43am

I’ve hacked together a precompilation script here, which produces a sysimage with all the necessary Distributed code precompiled. This drastically reduces the time taken to call @everywhere 1+2. I’ve tested it on 100 processes, and the time goes down from ~70 seconds to ~2 seconds. If this scales up to thousands of processes, it’s definitely something I can handle.

kristoffer.carlsson · August 14, 2021, 5:33am

Can you try reverting https://github.com/JuliaLang/julia/pull/37816 and see how it changes things?

moble · August 14, 2021, 8:07pm

Hm. Weird results. It is true that the parent (2f6508d) of that PR’s merge commit is significantly faster to call @everywhere 1+2 on 100 processes: 13.7s, as opposed to 52.7s after the merge (a4bfb9ce14).

It’s interesting that the slower time from that old code is still noticeably faster than v1.6.2 without any extra precompilation. But also, if I modify the parent commit to also precompile @everywhere 1+2, it doesn’t improve the time significantly; it’s still nowhere near the ~2 seconds I get from precompiling that on v1.6.2.

However, if I check out current master and try to reverse that patch manually, it doesn’t make any difference at all. I notice that toward the end of the build, it reports

Executing precompile statements... 1209/1242

whether it’s fresh master or I remove those comments you put in with that PR. I guess Distributed === nothing at that point in a fresh build nowadays?

kristoffer.carlsson · August 15, 2021, 6:44am

It shouldn’t be, did you check it? You can just print it out.

moble · August 15, 2021, 1:55pm

Oh you’re right. It’s definitely something at that point, and precompile_script is definitely getting that new bit of code. Still, it doesn’t change the numbers being reported at the end of the build process.

kristoffer.carlsson · August 16, 2021, 3:27pm

Thanks for noticing this. Try with ensure that the non-REPL precompile code actually get precompiled by KristofferC · Pull Request #41898 · JuliaLang/julia · GitHub.

moble · August 16, 2021, 8:24pm

That does seem to do something. With your PR, I see

Executing precompile statements... 1411/1447

which is up from 1368/1404 on the parent commit.

However, if I now re-enable that bit of code you disabled in #37816, it gives me this (complete with ugly formatting):

Generating REPL precompile statements... 31/31
Executing precompile statements... 176/1602┌ Warning: Inactive precompile statement
│   form = Tuple{Distributed.var"#@distributed", LineNumberNode, Module, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any, Any}
└ @ nothing nothing:0
Executing precompile statements... 1564/1602

(If I also add @everywhere 1+2, I get just that same warning, as well as 1588/1626.) I guess that increase means that a lot of what @distributed needs is getting compiled despite the warning.

kristoffer.carlsson · August 16, 2021, 9:03pm

Yeah, warnings like that are not unexpected.

Topic		Replies	Views
@everywhere takes a very long time when using a cluster General Usage	16	3095	February 11, 2022
Baffling addprocs() with @everywhere Julia at Scale	17	3492	April 24, 2018
Parallel is very slow General Usage parallel	16	4744	March 9, 2018
Questions about getting started with parallel computing Julia at Scale	18	5852	June 22, 2019
Julia 1.0 Example of @distributed and pmap Julia at Scale	1	3826	August 26, 2018

Why does julia use thousands of cpu-hours to compute 1+2?

Related topics