Julia performance on the same task on different CPUs and OSs

The workload

Arrow files with 250K individuals and 3M education status observations
Read into memory as DataFrames, join individual records with education status observations.
Group resulting data frame by individual and process the on average 12 observations per individual.
Process individual education records to eliminate retrogression in educational attainment, and fill gaps in annual educational attainment using linear interpolation.
Write out the resulting cleaned dataset as an arrow file.
No multi-threading is used by the code to, for example, process individuals independently

Julia version: 1.10.4
Packages: DataFrames, Arrow, Dates, StatsBase
Code available here: Source code

Question

Why the major differences in performance for the exact same code in the different environments in the table below:

Results

Time (hh:mm:ss) OS CPU Cores Machine
00:26:42 Windows 11 Xeon Gold 6430 8 VMWare virtual machine 128GB RAM
00:29:17 Win Server 2019 Xeon Cascadelake 25 Openstack virtual machine 256GB RAM
00:12:12 Windows 11 i7-13700H 14 Dell XPS 17 64GB RAM
00:11:47 RaspianOs Cortex A76 4 Raspberry Pi 5 8GB
00:07:53 Windows 11 Ryzen 7 3800 8 Desktop 128GB RAM
00:01:51 MacOs 14.5 M3 Max 14 MacBook Pro 36GB RAM

I realise that the virtual environment may have different workloads in the background, but the results were fairly consistent over several different runs at different times and the differences are large! 99% of the execution time is spent in the code containing no file reads or writes.

Thank you for pointing me to ways in which I can improve performance in the virtual environment, because that is the common resource in my environment.

this is likely a combination of windows having a slow filesystem, the servers having slow drives, and antivirus software that is slowing down the windows machines.

1 Like

The Raspberry Pi is using an SD card for storage!

Well, Linux is faster than Windows… If you use virtual machines anyways, why not using a virtual machine with Linux?

I did test a Linux virtual machine:

Time OS CPU Cores Machine
00:25:54 Ubuntu Xeon Gold 6430 8 VMWare virtual machine

Perhaps there is something wrong with the configuration of your virtual machines?

  • how much RAM do you assign to the VM?
  • how many cores do you assign to the VM?
  • with how many threads to you start Julia?

And then there is the option “Virtualize Intel VT-x/EPT or AMD-V/RVI” which you can check or not.

Ran the application on bare metal server installed with Ubuntu 22.04LTS and used wine as compatibility layer.

Started cleaning education status
=== Finished cleaning education status after 46 minutes, 48 seconds, 524 milliseconds

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
Stepping: 4
BogoMIPS: 5200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 768 KiB (24 instances)
L1i: 768 KiB (24 instances)
L2: 24 MiB (24 instances)
L3: 38.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

Machine RAM: 315GB

If you have the latest Windows 11 you may activate the Dev Drive.
It creates a working space which is much more efficient with handling I/O tasks on the HD.
It is based on ReFS.

By the way, the reason Windows’ performance with HD I/O calls is slower is due to a design choice of allowing 3rd party API on calls.
It is a feature of the OS and its modular structure. The price is higher overhead.
A good example of the benefits of such architecture is voidtools - Everything - Locate files and folders by name instantly.

Although the test workload used here is not IO constrained, I do have other real world workloads that are much more disk intensive. I will give this a try.