Julia performance on the same task on different CPUs and OSs

kobusherbst · August 7, 2024, 12:37pm

The workload

Arrow files with 250K individuals and 3M education status observations
Read into memory as DataFrames, join individual records with education status observations.
Group resulting data frame by individual and process the on average 12 observations per individual.
Process individual education records to eliminate retrogression in educational attainment, and fill gaps in annual educational attainment using linear interpolation.
Write out the resulting cleaned dataset as an arrow file.
No multi-threading is used by the code to, for example, process individuals independently

Julia version: 1.10.4
Packages: DataFrames, Arrow, Dates, StatsBase
Code available here: Source code

Question

Why the major differences in performance for the exact same code in the different environments in the table below:

Results

Time (hh:mm:ss)	OS	CPU	Cores	Machine
00:26:42	Windows 11	Xeon Gold 6430	8	VMWare virtual machine 128GB RAM
00:29:17	Win Server 2019	Xeon Cascadelake	25	Openstack virtual machine 256GB RAM
00:12:12	Windows 11	i7-13700H	14	Dell XPS 17 64GB RAM
00:11:47	RaspianOs	Cortex A76	4	Raspberry Pi 5 8GB
00:07:53	Windows 11	Ryzen 7 3800	8	Desktop 128GB RAM
00:01:51	MacOs 14.5	M3 Max	14	MacBook Pro 36GB RAM

I realise that the virtual environment may have different workloads in the background, but the results were fairly consistent over several different runs at different times and the differences are large! 99% of the execution time is spent in the code containing no file reads or writes.

Thank you for pointing me to ways in which I can improve performance in the virtual environment, because that is the common resource in my environment.

Oscar_Smith · August 7, 2024, 12:46pm

this is likely a combination of windows having a slow filesystem, the servers having slow drives, and antivirus software that is slowing down the windows machines.

kobusherbst · August 7, 2024, 12:52pm

The Raspberry Pi is using an SD card for storage!

ufechner7 · August 7, 2024, 4:43pm

Well, Linux is faster than Windows… If you use virtual machines anyways, why not using a virtual machine with Linux?

kobusherbst · August 8, 2024, 8:25am

I did test a Linux virtual machine:

Time	OS	CPU	Cores	Machine
00:25:54	Ubuntu	Xeon Gold 6430	8	VMWare virtual machine

ufechner7 · August 8, 2024, 8:35am

Perhaps there is something wrong with the configuration of your virtual machines?

how much RAM do you assign to the VM?
how many cores do you assign to the VM?
with how many threads to you start Julia?

And then there is the option “Virtualize Intel VT-x/EPT or AMD-V/RVI” which you can check or not.

BrendanG · August 8, 2024, 8:42pm

Ran the application on bare metal server installed with Ubuntu 22.04LTS and used wine as compatibility layer.

Started cleaning education status
=== Finished cleaning education status after 46 minutes, 48 seconds, 524 milliseconds

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
Stepping: 4
BogoMIPS: 5200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 768 KiB (24 instances)
L1i: 768 KiB (24 instances)
L2: 24 MiB (24 instances)
L3: 38.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

Machine RAM: 315GB

RoyiAvital · August 13, 2024, 1:55pm

If you have the latest Windows 11 you may activate the Dev Drive.
It creates a working space which is much more efficient with handling I/O tasks on the HD.
It is based on ReFS.

By the way, the reason Windows’ performance with HD I/O calls is slower is due to a design choice of allowing 3rd party API on calls.
It is a feature of the OS and its modular structure. The price is higher overhead.
A good example of the benefits of such architecture is voidtools - Everything - Locate files and folders by name instantly.

kobusherbst · August 18, 2024, 4:57am

Although the test workload used here is not IO constrained, I do have other real world workloads that are much more disk intensive. I will give this a try.

Topic		Replies	Views
Any benchmark comparing Julia on Windows vs Linux vs OSX? Performance	7	2936	June 29, 2021
What can cause significantly different performance for pisum microbenchmark on different workstations Performance	11	1006	May 12, 2019
Performance slowdown on AMD with Windows10 compared to Intel with Linux (arch) Performance performance , operating-system , matrices	7	1007	June 3, 2022
Julia slower in Windows Performance	16	3242	February 23, 2019
Show off Julia performance on your PC! Performance	53	4327	April 26, 2020

Julia performance on the same task on different CPUs and OSs

The workload

Question

Results

Related topics