Memory usage and performance differences in 1.9-rc1 vs. 1.8.5

BioTurboNick · March 16, 2023, 9:09pm

I have some memory- and CPU-intensive multithreaded image analysis code.

In 1.8.5, on an AWS Linux system with 64 GB of RAM and 8 vCPUs, the code will run in cycles, spiking up to (for one example) 23 GB of allocated RAM and 800% CPU usage, executing in 37 min. Running the GC at the end drops memory usage to 7 GB, and then with malloc_trim, 3.0 GB.

On 1.9-rc1, I observe no more than 600% CPU usage, no more than 7 GB allocated (in a second phase that isn’t parallel, goes up to 14 GB), executing in 33 min. Running the GC at the end drops memory usage to 5.0 GB, and then with malloc_trim, 2.9 GB.

May be related to @Paul_Soderlind 's observation here: 1.9-rc1 and threads

mbauman · March 16, 2023, 9:17pm

So by absolutely every metric, Julia 1.9 is better? That’s great.

BioTurboNick · March 16, 2023, 10:25pm

Why isn’t multithreaded code saturating the CPUs anymore though?

simsurace · March 17, 2023, 9:23am

I wonder about this as well. While the result is faster overall, it seems that Julia 1.9 still leaves 25% of performance on the table in this example.

Sukera · March 17, 2023, 9:32am

It’s hard to diagnose that remotely without source code. It’s quite possible that a lot of time in 1.8.5 was spent in the kernel for allocations, which is reduced by fewer allocations in 1.9, leading to an overall reduction in CPU usage.

cjdoris · March 17, 2023, 4:58pm

If you were fully utilising your CPUs before and now you’re not, then your code is no longer CPU bounded - which is good because it means Julia has generated machine code which uses the CPU more efficiently.

Perhaps it’s now memory bandwidth bounded, given how much memory you’re using. If you want to go even faster you may need to improve your memory access patterns.

BioTurboNick · March 17, 2023, 9:39pm

strace doesn’t show many significant differences in system calls, except 40% fewer futex calls in 1.9-rc1, 30% of which error (proportion the same as 1.8.5)

What change in Julia could explain a switch from CPU-bound to memory-bound? julia/NEWS.md at v1.9.0-rc1 · JuliaLang/julia · GitHub

StefanKarpinski · March 20, 2023, 2:24pm

You wouldn’t see a difference in syscalls, since that’s the part of the workload that involves interacting with the rest of the world, which presumably hasn’t changed. Julia generating faster machine code could account for work getting done more efficiently and there not being enough of it to saturate more than six cores anymore. How well parallelized is your work?

There’s always the option to spin up two tamales that busy wait if you really want to see those last two cores pegged

StefanKarpinski · March 20, 2023, 2:28pm

Certainly 25% of performance is left on the table, but it may or may not be Julia that’s leaving it. The first suspect is the code: if it doesn’t expose sufficient parallelism then nothing the language does can fix that. If it does expose sufficient parallelism, the you can start looking at the language.

BioTurboNick · March 20, 2023, 3:02pm

Thanks all, I’ll take a look deeper into the code.

Paul_Soderlind · March 20, 2023, 3:35pm

Any chance you could come up with something that is portable? (I had doubts along similar lines, but failed to create something that others could easily run.)

BioTurboNick · March 25, 2023, 12:19am

I’ve been profiling to see what’s going on.

Looks like array copying (inside collect) is faster, sorting (inside median) is faster, imfilter is faster.

However I now see a substantial amount of time spent in runtime dispatch into the StaticArray constructor.

EDIT: And the allocation profiler shows that StaticArray is allocating Core.SimpleVectors… wat. Issue here: 1.9-rc1 regression - `StaticArray` allocates `Core.SimpleVector` objects, runtime dispatch · Issue #49145 · JuliaLang/julia · GitHub

BioTurboNick · March 27, 2023, 4:12am

Turns out that in 1.9-rc1, there’s some sort of optimization failure in the case that an array length is passed into a type parameter.

Topic		Replies	Views
The improvement of memory management in multithreading Julia 1.10beta is amazing! General Usage multithreading , memory	6	1131	March 29, 2024
Resource usage in versions 1.7.2 and 1.8.0 Performance	19	1475	August 30, 2022
Memory usage in Julia v1.9.0-beta2 General Usage	5	722	December 30, 2022
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	634	August 7, 2018
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1822	April 6, 2021

Memory usage and performance differences in 1.9-rc1 vs. 1.8.5

Related topics