Julia slower in Windows

Hello,

I’m running this simple piece of code in two different computers:

using BenchmarkTools

function testing()
    rand(1000000)
end

@btime testing()
  • HP EliteBook 840 G3. i5-6300U 2.4Ghz. 16GB DRR4. Windows 10.
4.048 ms (2 allocations: 7.63 MiB)
  • MacBook Air, Mid 2013. i5-4250U 1.3Ghz. 8GB DDR. macOS Mojave 10.14.3
1.973 ms (2 allocations: 7.63 MiB)

I was expecting the HP to be much faster. Can anyone tell me how is this possible? Is there something wrong with Windows performance? How can I fix this?

Thank you,

2 Likes

Caching of memory? It’s much slower on Windows than Mac for me too, and it’s not related to rand but rather memory writes in general. On my Windows laptop, something seems to happen right around 1 MB:

julia> @btime zeros(130_000);
  62.768 μs (2 allocations: 1015.70 KiB)

julia> @btime zeros(131_000);
  341.468 μs (2 allocations: 1023.52 KiB)

On my Mac, I don’t see such sudden degradation of performance.

It’s quite an artificial test though. I normally wouldn’t expect Windows to be any slower than Mac. Do you have a less artificial test that also shows similar performance difference between Windows and Mac?

3 Likes

This does seem a bit odd at face value (very crudely, userbenchmark suggests +30% difference in favor of the HP. Are these the exact same versions of julia on each machine? I assume you have 64 bit windows?

Keep in mind here that you might be mostly benchmarking the random number library; digging in a bit using Cthulu.jl and @descend testing(), I see that Julia is currently using DSFMT as the Mersenne Twister implementation in this case. This means the speed will depend on the C compilation flags which were used to build DSFMT, not so much on the julia JIT. It might be that the DSFMT build is faster on your MacBook Air because the compiler flags were more specialized to the architecture there.

Having written the above, I just saw the reply by @bennedich so perhaps it’s not related to the build at all. That @btime zeros result is very odd though, I’m not sure what to make of it.

1 Like

Is this memory writes or allocation? I do wonder whether the windows allocator is just falling off a performance cliff above your threshold there. 300 μs seems quite excessive though.

PS: There’s definitely some performance cliffs in the Vista allocator, see Memory Allocator Benchmarks for example.

Yes, I’m using 1.1.0 in both computers and 64 bit Windows.

@bennedich I observe the same jump!

HP:

julia> @btime zeros(130000)
  102.154 μs (2 allocations: 1015.70 KiB)

julia> @btime zeros(131000)
  457.025 μs (2 allocations: 1023.52 KiB)

MacBook Air:

julia> @btime zeros(130000)
  91.101 μs (2 allocations: 1015.70 KiB)

julia> @btime zeros(131000)
  93.157 μs (2 allocations: 1023.52 KiB)

What would be a less artificial test?

I don’t quite see it for allocations on my Windows laptop:

julia> @btime Vector{Float64}(undef, 130_000);
  4.191 μs (2 allocations: 1015.70 KiB)

julia> @btime Vector{Float64}(undef, 131_000);
  4.671 μs (2 allocations: 1023.52 KiB)

System:

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)
1 Like

And a bizarre huge jump:

julia> @btime Vector{Float64}(undef, 130_000);
  114.000 ns (2 allocations: 1015.70 KiB)

julia> @btime Vector{Float64}(undef, 131_000);
  3.721 μs (2 allocations: 1023.52 KiB)

System:

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

A real computation you want to do in practice is always a good test :wink:

Failing that I guess you could run the official julia microbenchmarks.

2 Likes

I’ve realized of this performance differences when running a set of much more elaborated nested functions. Then is when I decided to test something simpler like the rand example.

I’ll try microbenchmarks. Thanks!

Oh right, that’s good to have some context. Then I’d suggest profiling the larger computation and pulling out the hot functions for individual attention. Perhaps this is what led you to rand() in the first place? In that case it’s time to investigate why rand is slow, which might very well be the DSFMT compile options, or something weird with memory access.

It’s worth noting that the supposedly weird numbers 130_000 and 131_000 are actually arrays of size straddling the 2^20 boundary (1MiB). Though your HP has 3MiB L3 cache and 512K L2 which suggests this isn’t processor related. In addition, bandwidth to main memory for the i5-6300U is supposedly a max of 31 GiB/s which should be ~ 30 μs for 1 MiB (is this symmetrical for read/write?). That suggests something really wrong is happening if it’s taking 300 μs just to allocate an array of zeros of size about 1 MiB, even forgetting about the cache.

If you want to keep digging into the weirdness with zeros I’d suggest you benchmark fill! separately from the allocation.

@btime fill!($(zeros(131_000)), 0.0);
2 Likes

It seems OS related. Running the script below in Julia 1.1 on the same system booted in Windows 10 and Ubuntu illustrates a performance cliff on Windows right around 1 MB, not present on Ubuntu.

Of interest is that I don’t see a performance cliff when measuring average times. Both the script below and @btime measures minimum time of many invocations.

using Plots

benchmark_op(n) = zeros(n * 1024)

sizes = []
times = []
for n = round.(Int, 1.1 .^(24:73))'
    t = minimum(1:1000 .|>_-> @elapsed benchmark_op(n))
    append!(sizes, n*sizeof(Float64))
    append!(times, t * 1e6)
end

plot(sizes, times; title="Elapsed time for zero(n)",
    xlabel="vector size (KiB)", ylabel="microseconds", legend=:none)

Windows on the left, Ubuntu on the right:

 

6 Likes

There are many times the OS doesn’t commit the allocation until the space is actually used so these type of artificial tests don’t say much of what would happen in a real situation.

1 Like

Agree with this when it comes to allocating (and that taking the minimum of 1000 invocations is a terrible estimator of actual memory IO performance), but wouldn’t calling zeros use the space?

1 Like

Just ran the same test in Java on Windows, and I’m seeing a similar cliff at 1 MB. So thankfully it seems unrelated to Julia.

for (int k = 24; k <= 73; ++k) {
    int n = (int)(Math.pow(1.1, k) + 0.5);
    long min = Long.MAX_VALUE;
    for (int i = 0; i < 1000; ++i) {
        long t = System.nanoTime();
        double[] A = new double[n * 1024];
        for (int e = 0; e < A.length; ++e)
            A[e] = 0.0;
        t = System.nanoTime() - t;
        if (t < min) min = t;
    }
    System.out.printf("%d, %f, ", n*8, min / 1000.0);
}

7 Likes

It’s great to see those various plots. As far as I can tell by looking at the julia GC runtime, all the large arrays are essentially allocated by a call to the standard system aligned version of malloc, so we might well be measuring the behavior of the default system malloc, in terms of how it ultimately uses syscalls to get memory and how aggressively it frees that back to the OS vs keeping it in a pool, etc.

I assume these performance oddities only happen for newly allocated memory?

That’s interesting I’ve always wondered why allocation introduces so much variance into timings. What does the distribution of times look like in (terms of min,max,median and quartiles for example)? The minimum time could be quite misleading for allocation heavy workloads.

Comparing the slopes of those two windows vs ubuntu graphs, it seems absurd that windows takes 0.18 ms/MiB whereas ubuntu takes 0.07 ms/MiB on the same machine.

Doing a bit more reading, I think we are measuring mostly the cost of OS page fault handling here. Allocating without touching the memory is fast because the OS doesn’t have to commit physical memory pages, but the first time a process iterates over a newly allocated array (to read or write it) a bunch of major page faults will be generated. Whether the memory is truly new to the process is invisible to julia because it uses the system malloc which may choose to reuse recently freed blocks. Here’s an interesting blog post about the hidden costs of memory allocation. It’s interesting that they say 1MB is indeed the threshold at which the windows allocator decides to directly allocate from the OS via a syscall.

The page fault mechanism also explains why allocation introduces so much variance to timing measurements: it gives the OS an excuse for a context switch.

4 Likes

We can use the script below to plot creation speed histograms for zeros for a range of array sizes.

using Plots

benchmark_op(m) = zeros(m * 128)

anim = @animate for n = 1:1280
    t = 1e6(1:10000 .|>_-> @elapsed benchmark_op(n)) / n * 1024
    display(histogram(log10.(filter(k -> 88 < k < 640, [t;89;640])),
        nbins=120, xlims=(1.95,2.8), ylims=(0,2000), legend=:none,
        xformatter=x->round(Int, 10^x), title="array = $n KiB",
        xlabel="creation speed (us / MB)", ylabel="count"))
end

gif(anim, "allocations.gif", fps = 30)

Below are the results for a Windows 10 laptop. Note how creation speeds faster than ~300 μs/MiB completely disappear when the array is larger than 1024 KiB.

allocations

10 Likes

For completeness, below is the result of the same script run on Ubuntu. A bit more stable.

allocations-ubuntu2

6 Likes