Very different performance in different computers due to garbage collection?

Again a stupid reply from me… Can you try the same code on your Precision workstation with hyperthreading enabled then disabled?

Thanks! How many threads are you using? Does the problem happen with one thread?
In the desktop profile, we mostly see threads waiting for gc to finish, but it’s not clear whether that’s a cause or an effect. For example maybe using the CPU pause instruction in jl_safepoint_wait_gc is causing the kernel to de-schedule us for a long time on one system.

2 Likes

@jeff.bezanson
Hi, so indeed it is a lot faster if I set Julia.NumThreads to 1, and now with the unoptimized code:

@timev(main_mask())
 25.035355 seconds (4.84 M allocations: 6.051 GiB, 73.24% gc time)
elapsed time (ns): 25035355300
gc time (ns):      18335121200
bytes allocated:   6496948256
pool allocs:       4839956
non-pool GC allocs:390
malloc() calls:    392
GC pauses:         56

This is the link to the HTML file with the profiling:

https://drive.google.com/file/d/1-UlvCu-hf8rBQmmemHDZG-DA6hVjgkZH/view?usp=sharing

Now almost all of the time is spent at the function calc_sprawl(...), which is what I would expect, and it’s twice as fast as when using all threads…

@johnh I’m not sure if I’m allowed to change BIOS settings, I might have to talk with IT for that and I need the computer for work anyway, but the fact that simply setting Julia to one thread solves the problem tells that indeed it’s some problem with how the OS is handling multithreading, right?

You should be able to change BIOS settings by rebooting and pressing F2 during the boot.

On Linux there is a method to simulate HT being off - you set every odd CPU to offline state. The command chcpu can do this.
I don’t think there is a method to do this in Windows without a reboot.

2 Likes

Maybe it’s due to heap fragmentation on Windows 10 while on Windows 11 this isn’t happening because the implementation of the windows allocator has changed? (Don’t know if this is the case).
I’ve definitely experienced problems with julia heap fragmentation on windows 10.

@johnh I deactivated Hyper-Threading, this is the result with the unoptimized code when using 6 cores:

@timev(main_mask())
 23.149365 seconds (4.84 M allocations: 6.051 GiB, 72.40% gc time)
elapsed time (ns): 23149365300
gc time (ns):      16761282800
bytes allocated:   6496948256
pool allocs:       4839956
non-pool GC allocs:390
malloc() calls:    392
GC pauses:         45

It’s not any faster than when running one core and the GC is already inefficient compared to other computers I tried, but not nearly as much as when running with HT.

The full profile for this function is here:
https://drive.google.com/file/d/1jXlQ9z-LRqiVWUWV5AUXi5_pxw2DejuT/view?usp=sharing

@steven_sagaert I’m not sure, because (I think I mentioned this at some point) in a third computer which is running Windows 10 Server (It’s pretty old, but it’s true that it has a lot of RAM) the garbage collector behaves more like in the laptop.

hmm probably not that then although there might still be a difference between windows 10 server and windows 10 Workstation

Any update on this?

I experienced a similar problem.

A repository that I run on my computer at home (Intel Core i7-7700 CPU 4 cores @ 3.6GHZ, 16GB ram, Windows 10 pro) runs 2x-3x faster than on a new computer we got at work (AMD Threadripper 3990X 16 cores @ 3.5GHZ, 32GB ram, also Windows 10).

As in the original post, timing shows major differences in garbage collection - circa 15% gc time on my home computer, compared to 55% on the work computer, despite the work computer having much more RAM.

The repository is very large so I’m working on seeing whether I can come up with a MWE that illustrates the issues.

This problem seems like it might be related to garbage collection issues discussed in this post: GC going nuts

I’m going to see if upgrading to v1.9 and providing a heap size hint that is substantially more than what Julia appears to be using helps with the excess garbage collection time.