It’s great to see those various plots. As far as I can tell by looking at the julia GC runtime, all the large arrays are essentially allocated by a call to the standard system aligned version of
malloc, so we might well be measuring the behavior of the default system
malloc, in terms of how it ultimately uses syscalls to get memory and how aggressively it frees that back to the OS vs keeping it in a pool, etc.
I assume these performance oddities only happen for newly allocated memory?
That’s interesting I’ve always wondered why allocation introduces so much variance into timings. What does the distribution of times look like in (terms of min,max,median and quartiles for example)? The minimum time could be quite misleading for allocation heavy workloads.
Comparing the slopes of those two windows vs ubuntu graphs, it seems absurd that windows takes 0.18 ms/MiB whereas ubuntu takes 0.07 ms/MiB on the same machine.
Doing a bit more reading, I think we are measuring mostly the cost of OS page fault handling here. Allocating without touching the memory is fast because the OS doesn’t have to commit physical memory pages, but the first time a process iterates over a newly allocated array (to read or write it) a bunch of major page faults will be generated. Whether the memory is truly new to the process is invisible to julia because it uses the system
malloc which may choose to reuse recently freed blocks. Here’s an interesting blog post about the hidden costs of memory allocation. It’s interesting that they say 1MB is indeed the threshold at which the windows allocator decides to directly allocate from the OS via a syscall.
The page fault mechanism also explains why allocation introduces so much variance to timing measurements: it gives the OS an excuse for a context switch.