I have some memory- and CPU-intensive multithreaded image analysis code.
In 1.8.5, on an AWS Linux system with 64 GB of RAM and 8 vCPUs, the code will run in cycles, spiking up to (for one example) 23 GB of allocated RAM and 800% CPU usage, executing in 37 min. Running the GC at the end drops memory usage to 7 GB, and then with malloc_trim, 3.0 GB.
On 1.9-rc1, I observe no more than 600% CPU usage, no more than 7 GB allocated (in a second phase that isn’t parallel, goes up to 14 GB), executing in 33 min. Running the GC at the end drops memory usage to 5.0 GB, and then with malloc_trim, 2.9 GB.
It’s hard to diagnose that remotely without source code. It’s quite possible that a lot of time in 1.8.5 was spent in the kernel for allocations, which is reduced by fewer allocations in 1.9, leading to an overall reduction in CPU usage.
If you were fully utilising your CPUs before and now you’re not, then your code is no longer CPU bounded - which is good because it means Julia has generated machine code which uses the CPU more efficiently.
Perhaps it’s now memory bandwidth bounded, given how much memory you’re using. If you want to go even faster you may need to improve your memory access patterns.
strace doesn’t show many significant differences in system calls, except 40% fewer futex calls in 1.9-rc1, 30% of which error (proportion the same as 1.8.5)
You wouldn’t see a difference in syscalls, since that’s the part of the workload that involves interacting with the rest of the world, which presumably hasn’t changed. Julia generating faster machine code could account for work getting done more efficiently and there not being enough of it to saturate more than six cores anymore. How well parallelized is your work?
There’s always the option to spin up two tamales that busy wait if you really want to see those last two cores pegged
Certainly 25% of performance is left on the table, but it may or may not be Julia that’s leaving it. The first suspect is the code: if it doesn’t expose sufficient parallelism then nothing the language does can fix that. If it does expose sufficient parallelism, the you can start looking at the language.
Any chance you could come up with something that is portable? (I had doubts along similar lines, but failed to create something that others could easily run.)