Again a stupid reply from me… Can you try the same code on your Precision workstation with hyperthreading enabled then disabled?
Thanks! How many threads are you using? Does the problem happen with one thread?
In the desktop profile, we mostly see threads waiting for gc to finish, but it’s not clear whether that’s a cause or an effect. For example maybe using the CPU pause instruction in jl_safepoint_wait_gc is causing the kernel to de-schedule us for a long time on one system.
Hi, so indeed it is a lot faster if I set
Julia.NumThreads to 1, and now with the unoptimized code:
@timev(main_mask()) 25.035355 seconds (4.84 M allocations: 6.051 GiB, 73.24% gc time) elapsed time (ns): 25035355300 gc time (ns): 18335121200 bytes allocated: 6496948256 pool allocs: 4839956 non-pool GC allocs:390 malloc() calls: 392 GC pauses: 56
This is the link to the HTML file with the profiling:
Now almost all of the time is spent at the function
calc_sprawl(...), which is what I would expect, and it’s twice as fast as when using all threads…
@johnh I’m not sure if I’m allowed to change BIOS settings, I might have to talk with IT for that and I need the computer for work anyway, but the fact that simply setting Julia to one thread solves the problem tells that indeed it’s some problem with how the OS is handling multithreading, right?
You should be able to change BIOS settings by rebooting and pressing F2 during the boot.
On Linux there is a method to simulate HT being off - you set every odd CPU to offline state. The command chcpu can do this.
I don’t think there is a method to do this in Windows without a reboot.
Maybe it’s due to heap fragmentation on Windows 10 while on Windows 11 this isn’t happening because the implementation of the windows allocator has changed? (Don’t know if this is the case).
I’ve definitely experienced problems with julia heap fragmentation on windows 10.
@johnh I deactivated Hyper-Threading, this is the result with the unoptimized code when using 6 cores:
@timev(main_mask()) 23.149365 seconds (4.84 M allocations: 6.051 GiB, 72.40% gc time) elapsed time (ns): 23149365300 gc time (ns): 16761282800 bytes allocated: 6496948256 pool allocs: 4839956 non-pool GC allocs:390 malloc() calls: 392 GC pauses: 45
It’s not any faster than when running one core and the GC is already inefficient compared to other computers I tried, but not nearly as much as when running with HT.
The full profile for this function is here:
@steven_sagaert I’m not sure, because (I think I mentioned this at some point) in a third computer which is running Windows 10 Server (It’s pretty old, but it’s true that it has a lot of RAM) the garbage collector behaves more like in the laptop.
hmm probably not that then although there might still be a difference between windows 10 server and windows 10 Workstation