What can cause significantly different performance for pisum microbenchmark on different workstations

adityam · April 17, 2019, 11:19pm

I am trying to understand what factors impact the performance of Julia code on different hardware. I have two workstations: WS1 is a three old old compute server and WS2 is a two year old desktop, both of which have comparable configuration but the performance of pisum() and pisumvec() microbenchmarks from Julia Microbenchmarks suite on the two workstations is very different.

Workstation 1

Basic information:

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)

shell> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               45
Model name:          Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Stepping:            7
CPU MHz:             3160.205
CPU max MHz:         3300.0000
CPU min MHz:         1200.0000
BogoMIPS:            5199.93
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts flush_l1d

shell> uname -vrm
4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64

Microbenchmarks:

julia> function pisum()
           sum = 0.0
           for j = 1:500
               sum = 0.0
               for k = 1:10000
                   sum += 1.0/(k*k)
               end
           end
           sum
       end

julia> function pisumvec()
           s = 0.0
           a = 1:10000
           for j = 1:500
               s = sum(1 ./ (a .^2))
           end
           s
       end

julia> using BenchmarkTools

julia> @benchmark pisum()
@benchmark pisumvec()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.724 ms (0.00% GC)
  median time:      33.788 ms (0.00% GC)
  mean time:        34.040 ms (0.00% GC)
  maximum time:     36.056 ms (0.00% GC)
  --------------
  samples:          147
  evals/sample:     1

julia> @benchmark pisumvec()
BenchmarkTools.Trial: 
  memory estimate:  38.20 MiB
  allocs estimate:  1500
  --------------
  minimum time:     18.673 ms (1.68% GC)
  median time:      19.625 ms (3.27% GC)
  mean time:        19.881 ms (4.00% GC)
  maximum time:     62.710 ms (69.23% GC)
  --------------
  samples:          252
  evals/sample:     1

So, pisum() runs in 33ms and switching to vectorized version runs faster by almost a factor of 2.

Workstation 2

Basic information:

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-1603 v4 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)

shell> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-1603 v4 @ 2.80GHz
Stepping:            1
CPU MHz:             1197.235
CPU max MHz:         2800.0000
CPU min MHz:         1200.0000
BogoMIPS:            5589.81
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            10240K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts

shell> uname -vrm
5.0.7-arch1-1-ARCH #1 SMP PREEMPT Mon Apr 8 10:37:08 UTC 2019 x86_64

Microbenchmark:


julia> function pisum()
           sum = 0.0
           for j = 1:500
               sum = 0.0
               for k = 1:10000
                   sum += 1.0/(k*k)
               end
           end
           sum
       end
pisum (generic function with 1 method)

julia> function pisumvec()
           s = 0.0
           a = 1:10000
           for j = 1:500
               s = sum(1 ./ (a .^2))
           end
           s
       end
pisumvec (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark pisum()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.222 ms (0.00% GC)
  median time:      8.226 ms (0.00% GC)
  mean time:        8.243 ms (0.00% GC)
  maximum time:     11.918 ms (0.00% GC)
  --------------
  samples:          607
  evals/sample:     1

julia> @benchmark pisumvec()
BenchmarkTools.Trial: 
  memory estimate:  38.20 MiB
  allocs estimate:  1500
  --------------
  minimum time:     9.080 ms (3.22% GC)
  median time:      9.398 ms (6.13% GC)
  mean time:        9.440 ms (6.40% GC)
  maximum time:     51.545 ms (82.85% GC)
  --------------
  samples:          530
  evals/sample:     1

Here pisum() runs in 8.2ms and vectorizing actually makes it slightly slower.

For other microbenchmarks with for loops (numeric vector sort and mandelbrot set), the performance on the two workstations is almost the same (5-10% difference). I don’t understand why pisum() performs so poorly (a factor of 4 difference) and why vectorization is qualitatively so different (on workstation 1 it improves performance by a factor of 2; on workstation 2 it slightly deteriorates performance). The two workstations use different linux installations (Ubuntu vs Arch)

Any hints on what might be going on here?

klaff · April 17, 2019, 11:28pm

Have you compared the compiled versions of the functions, with @code_llvm and/or @code_native?

tkoolen · April 18, 2019, 12:59am

I noticed that workstation 2 has a CPU that doesn’t support hyper-threading, while workstation 1 has a CPU that does. For this benchmark, I think hyper-threading could possibly hurt performance. You could try disabling it on WS1 for this benchmark, see https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/linuxtips.md#hyperthreading.

adityam · April 18, 2019, 2:31am

Good point. Both LLVM and native code are identical:

Workstation 1

julia> @code_llvm pisum()

;  @ REPL[1]:2 within `pisum'
define double @julia_pisum_12234() {
top:
;  @ REPL[1]:3 within `pisum'
  br label %L2

L2:                                               ; preds = %L22, %top
  %value_phi = phi i64 [ 1, %top ], [ %7, %L22 ]
;  @ REPL[1]:5 within `pisum'
  br label %L4

L4:                                               ; preds = %L4, %L2
  %value_phi1 = phi double [ 0.000000e+00, %L2 ], [ %3, %L4 ]
  %value_phi2 = phi i64 [ 1, %L2 ], [ %5, %L4 ]
;  @ REPL[1]:6 within `pisum'
; ┌ @ int.jl:54 within `*'
   %0 = mul i64 %value_phi2, %value_phi2
; └
; ┌ @ promotion.jl:316 within `/'
; │┌ @ promotion.jl:284 within `promote'
; ││┌ @ promotion.jl:261 within `_promote'
; │││┌ @ number.jl:7 within `convert'
; ││││┌ @ float.jl:60 within `Type'
       %1 = sitofp i64 %0 to double
; │└└└└
; │ @ promotion.jl:316 within `/' @ float.jl:401
   %2 = fdiv double 1.000000e+00, %1
; └
; ┌ @ float.jl:395 within `+'
   %3 = fadd double %value_phi1, %2
; └
; ┌ @ range.jl:594 within `iterate'
; │┌ @ promotion.jl:403 within `=='
    %4 = icmp eq i64 %value_phi2, 10000
; │└
; │ @ range.jl:595 within `iterate'
; │┌ @ int.jl:53 within `+'
    %5 = add nuw nsw i64 %value_phi2, 1
; └└
  br i1 %4, label %L22, label %L4

L22:                                              ; preds = %L4
; ┌ @ range.jl:594 within `iterate'
; │┌ @ promotion.jl:403 within `=='
    %6 = icmp eq i64 %value_phi, 500
; │└
; │ @ range.jl:595 within `iterate'
; │┌ @ int.jl:53 within `+'
    %7 = add nuw nsw i64 %value_phi, 1
; └└
  br i1 %6, label %L33, label %L2

L33:                                              ; preds = %L22
;  @ REPL[1]:9 within `pisum'
  ret double %3
}

julia> @code_native pisum()
        .text
; ┌ @ REPL[1]:2 within `pisum'
        movl    $1, %eax
        movabsq $140618474565376, %rcx  # imm = 0x7FE44A39AF00
        vmovsd  (%rcx), %xmm1           # xmm1 = mem[0],zero
        nopw    %cs:(%rax,%rax)
L32:
        vxorpd  %xmm0, %xmm0, %xmm0
        movl    $1, %ecx
        nopl    (%rax)
; │ @ REPL[1]:6 within `pisum'
; │┌ @ int.jl:54 within `*'
L48:
        movq    %rcx, %rdx
        imulq   %rdx, %rdx
; │└
; │┌ @ promotion.jl:316 within `/'
; ││┌ @ promotion.jl:284 within `promote'
; │││┌ @ promotion.jl:261 within `_promote'
; ││││┌ @ number.jl:7 within `convert'
; │││││┌ @ float.jl:60 within `Type'
        vcvtsi2sdq      %rdx, %xmm3, %xmm2
; │└└└└└
; │┌ @ float.jl:401 within `/'
        vdivsd  %xmm2, %xmm1, %xmm2
; │└
; │┌ @ float.jl:395 within `+'
        vaddsd  %xmm2, %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ int.jl:53 within `+'
        addq    $1, %rcx
; │└└
; │┌ @ promotion.jl:403 within `iterate'
        cmpq    $10001, %rcx            # imm = 0x2711
; │└
        jne     L48
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
        cmpq    $500, %rax              # imm = 0x1F4
; ││└
; ││ @ range.jl:595 within `iterate'
; ││┌ @ int.jl:53 within `+'
        leaq    1(%rax), %rax
; │└└
        jne     L32
; │ @ REPL[1]:9 within `pisum'
        retq
        nop
; └

Workstation 2

julia> @code_llvm pisum()

;  @ REPL[1]:2 within `pisum'
define double @julia_pisum_12229() {
top:
;  @ REPL[1]:3 within `pisum'
  br label %L2

L2:                                               ; preds = %L22, %top
  %value_phi = phi i64 [ 1, %top ], [ %7, %L22 ]
;  @ REPL[1]:5 within `pisum'
  br label %L4

L4:                                               ; preds = %L4, %L2
  %value_phi1 = phi double [ 0.000000e+00, %L2 ], [ %3, %L4 ]
  %value_phi2 = phi i64 [ 1, %L2 ], [ %5, %L4 ]
;  @ REPL[1]:6 within `pisum'
; ┌ @ int.jl:54 within `*'
   %0 = mul i64 %value_phi2, %value_phi2
; └
; ┌ @ promotion.jl:316 within `/'
; │┌ @ promotion.jl:284 within `promote'
; ││┌ @ promotion.jl:261 within `_promote'
; │││┌ @ number.jl:7 within `convert'
; ││││┌ @ float.jl:60 within `Type'
       %1 = sitofp i64 %0 to double
; │└└└└
; │ @ promotion.jl:316 within `/' @ float.jl:401
   %2 = fdiv double 1.000000e+00, %1
; └
; ┌ @ float.jl:395 within `+'
   %3 = fadd double %value_phi1, %2
; └
; ┌ @ range.jl:594 within `iterate'
; │┌ @ promotion.jl:403 within `=='
    %4 = icmp eq i64 %value_phi2, 10000
; │└
; │ @ range.jl:595 within `iterate'
; │┌ @ int.jl:53 within `+'
    %5 = add nuw nsw i64 %value_phi2, 1
; └└
  br i1 %4, label %L22, label %L4

L22:                                              ; preds = %L4
; ┌ @ range.jl:594 within `iterate'
; │┌ @ promotion.jl:403 within `=='
    %6 = icmp eq i64 %value_phi, 500
; │└
; │ @ range.jl:595 within `iterate'
; │┌ @ int.jl:53 within `+'
    %7 = add nuw nsw i64 %value_phi, 1
; └└
  br i1 %6, label %L33, label %L2

L33:                                              ; preds = %L22
;  @ REPL[1]:9 within `pisum'
  ret double %3
}

julia> @code_native pisum()
        .text
; ┌ @ REPL[1]:2 within `pisum'
        movl    $1, %eax
        movabsq $140148462140504, %rcx  # imm = 0x7F76DB4D3C58
        vmovsd  (%rcx), %xmm1           # xmm1 = mem[0],zero
        nopw    %cs:(%rax,%rax)
L32:
        vxorpd  %xmm0, %xmm0, %xmm0
        movl    $1, %ecx
        nopl    (%rax)
; │ @ REPL[1]:6 within `pisum'
; │┌ @ int.jl:54 within `*'
L48:
        movq    %rcx, %rdx
        imulq   %rdx, %rdx
; │└
; │┌ @ promotion.jl:316 within `/'
; ││┌ @ promotion.jl:284 within `promote'
; │││┌ @ promotion.jl:261 within `_promote'
; ││││┌ @ number.jl:7 within `convert'
; │││││┌ @ float.jl:60 within `Type'
        vcvtsi2sdq      %rdx, %xmm3, %xmm2
; │└└└└└
; │┌ @ float.jl:401 within `/'
        vdivsd  %xmm2, %xmm1, %xmm2
; │└
; │┌ @ float.jl:395 within `+'
        vaddsd  %xmm2, %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ int.jl:53 within `+'
        addq    $1, %rcx
; │└└
; │┌ @ promotion.jl:403 within `iterate'
        cmpq    $10001, %rcx            # imm = 0x2711
; │└
        jne     L48
; │┌ @ range.jl:594 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
        cmpq    $500, %rax              # imm = 0x1F4
; ││└
; ││ @ range.jl:595 within `iterate'
; ││┌ @ int.jl:53 within `+'
        leaq    1(%rax), %rax
; │└└
        jne     L32
; │ @ REPL[1]:9 within `pisum'
        retq
        nop
; └

adityam · April 18, 2019, 2:33am

Thanks for the suggestion. I don’t have root access on workstation 1. Let me ask my sysadmin to disable hyperthreading and see if that helps.

adityam · May 10, 2019, 9:09pm

It took a while :-), but I finally have hyperthreading disabled on workstation 1. However, it does not make any difference in the runtime of microbenchmark. What else could be causing such a slowdown?

tkoolen · May 11, 2019, 6:08am

Hmm, I don’t have any other ideas right now, maybe others know more. The CPUs have very similar specs (other than number of cores, which shouldn’t matter), and the computation just has to be CPU-bound for pisum.

adityam · May 11, 2019, 5:48pm

Thanks for your reply.

I tried comparing both systems using MATLAB bench command and here are the results:

Workstation 1

MATLAB is selecting SOFTWARE OPENGL rendering.

                               < M A T L A B (R) >
                     Copyright 1984-2018 The MathWorks, Inc.
                     R2018b (9.5.0.944444) 64-bit (glnxa64)
                                 August 28, 2018

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
>> bench

ans =

    0.1082    0.0812    0.0316    0.1303    0.3083    0.2087

Workstation 2

MATLAB is selecting SOFTWARE OPENGL rendering.

                               < M A T L A B (R) >
                     Copyright 1984-2018 The MathWorks, Inc.
                     R2018a (9.4.0.813654) 64-bit (glnxa64)
                                February 23, 2018

 
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> format short
>> bench       

ans =

    0.1473    0.1263    0.0176    0.1247    0.2415    0.1759

In Matlab, benchmarks 1 and 2 are LU factorization and FFT, where workstation 1 is faster than workstation 2; benchmarks 3 and 4 are ODE solver and sparse linear equation solver, where workstation 2 is faster.

So, WS1 is faster for MATLAB but slower for Julia.

tkoolen · May 11, 2019, 11:11pm

That could be because of the difference in the number of cores though (as MKL, the linear algebra library used by Matlab, is multi-threaded by default for larger matrices). Do you get the same result with matlab -singleCompThread?

Also, you could see if there’s a difference between the two workstations in the following:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> LinearAlgebra.peakflops()
4.966427863329033e10

adityam · May 11, 2019, 11:43pm

Matlab benchmarks with matlab -singleCompThread

Workstation 1: 0.4526 0.2684 0.0371 0.1244 0.6236 0.2086
Workstation 2: 0.2881 0.2164 0.0186 0.1110 0.2329 0.1714

Indeed, now Workstation 1 is about 1.6 times slower than workstation 1 on LU factorization, 1.2 times slower on FFT, 2 times on ODE and 1.1 times on Sparse linear equations. Indeed, it appears that the speed-up in Matlab was due to multi-threading.

Here is comparison of peakflops in Julia:

Workstation 1: 2.216693579081599e10
Workstation 2: 2.700957956861434e10

which is the same order as LU factorization in Matlab.

Is there something like MATLAB’s bench in Julia which I can use to compare two different machines in terms of performance.

tkoolen · May 12, 2019, 12:18am

Isn’t it the other way around?

adityam · May 12, 2019, 1:37am

D’oh. Sorry, I mixed up my workstations (there is a good reason that servers have names). I edited my previous posts to have the correct numbers.

So, WS1 is indeed slower than WS2 for both Matlab and Julia. I was getting the impression that WS1 is faster because by default Matlab is multi-threaded (and WS1 has more cores). Mystery solved. Thanks a ton @tkoolen!

Topic		Replies	Views
Show off Julia performance on your PC! Performance	53	4295	April 26, 2020
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19786	October 15, 2019
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	932	May 13, 2024
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1893	June 25, 2023
Any benchmark of Julia v1.0 vs older versions Performance	66	8089	April 3, 2019

What can cause significantly different performance for pisum microbenchmark on different workstations

Workstation 1

Basic information:

Microbenchmarks:

Workstation 2

Basic information:

Microbenchmark:

Workstation 1

Workstation 2

Workstation 1

Workstation 2

Related topics