Memory usage and performance differences in 1.9-rc1 vs. 1.8.5

I have some memory- and CPU-intensive multithreaded image analysis code.

In 1.8.5, on an AWS Linux system with 64 GB of RAM and 8 vCPUs, the code will run in cycles, spiking up to (for one example) 23 GB of allocated RAM and 800% CPU usage, executing in 37 min. Running the GC at the end drops memory usage to 7 GB, and then with malloc_trim, 3.0 GB.

On 1.9-rc1, I observe no more than 600% CPU usage, no more than 7 GB allocated (in a second phase that isn’t parallel, goes up to 14 GB), executing in 33 min. Running the GC at the end drops memory usage to 5.0 GB, and then with malloc_trim, 2.9 GB.

May be related to @Paul_Soderlind 's observation here: 1.9-rc1 and threads

7 Likes

So by absolutely every metric, Julia 1.9 is better? That’s great.

9 Likes

Why isn’t multithreaded code saturating the CPUs anymore though?

1 Like

I wonder about this as well. While the result is faster overall, it seems that Julia 1.9 still leaves 25% of performance on the table in this example.

1 Like

It’s hard to diagnose that remotely without source code. It’s quite possible that a lot of time in 1.8.5 was spent in the kernel for allocations, which is reduced by fewer allocations in 1.9, leading to an overall reduction in CPU usage.

4 Likes

If you were fully utilising your CPUs before and now you’re not, then your code is no longer CPU bounded - which is good because it means Julia has generated machine code which uses the CPU more efficiently.

Perhaps it’s now memory bandwidth bounded, given how much memory you’re using. If you want to go even faster you may need to improve your memory access patterns.

6 Likes

strace doesn’t show many significant differences in system calls, except 40% fewer futex calls in 1.9-rc1, 30% of which error (proportion the same as 1.8.5)

What change in Julia could explain a switch from CPU-bound to memory-bound? julia/NEWS.md at v1.9.0-rc1 · JuliaLang/julia · GitHub

You wouldn’t see a difference in syscalls, since that’s the part of the workload that involves interacting with the rest of the world, which presumably hasn’t changed. Julia generating faster machine code could account for work getting done more efficiently and there not being enough of it to saturate more than six cores anymore. How well parallelized is your work?

There’s always the option to spin up two tamales that busy wait if you really want to see those last two cores pegged :grin:

1 Like

Certainly 25% of performance is left on the table, but it may or may not be Julia that’s leaving it. The first suspect is the code: if it doesn’t expose sufficient parallelism then nothing the language does can fix that. If it does expose sufficient parallelism, the you can start looking at the language.

1 Like

Thanks all, I’ll take a look deeper into the code.

Any chance you could come up with something that is portable? (I had doubts along similar lines, but failed to create something that others could easily run.)

I’ve been profiling to see what’s going on.

Looks like array copying (inside collect) is faster, sorting (inside median) is faster, imfilter is faster.

However I now see a substantial amount of time spent in runtime dispatch into the StaticArray constructor.

EDIT: And the allocation profiler shows that StaticArray is allocating Core.SimpleVectors… wat. Issue here: 1.9-rc1 regression - `StaticArray` allocates `Core.SimpleVector` objects, runtime dispatch · Issue #49145 · JuliaLang/julia · GitHub

4 Likes

Turns out that in 1.9-rc1, there’s some sort of optimization failure in the case that an array length is passed into a type parameter.

2 Likes