The workload
Arrow files with 250K individuals and 3M education status observations
Read into memory as DataFrames, join individual records with education status observations.
Group resulting data frame by individual and process the on average 12 observations per individual.
Process individual education records to eliminate retrogression in educational attainment, and fill gaps in annual educational attainment using linear interpolation.
Write out the resulting cleaned dataset as an arrow file.
No multi-threading is used by the code to, for example, process individuals independently
Julia version: 1.10.4
Packages: DataFrames, Arrow, Dates, StatsBase
Code available here: Source code
Question
Why the major differences in performance for the exact same code in the different environments in the table below:
Results
Time (hh:mm:ss) | OS | CPU | Cores | Machine |
---|---|---|---|---|
00:26:42 | Windows 11 | Xeon Gold 6430 | 8 | VMWare virtual machine 128GB RAM |
00:29:17 | Win Server 2019 | Xeon Cascadelake | 25 | Openstack virtual machine 256GB RAM |
00:12:12 | Windows 11 | i7-13700H | 14 | Dell XPS 17 64GB RAM |
00:11:47 | RaspianOs | Cortex A76 | 4 | Raspberry Pi 5 8GB |
00:07:53 | Windows 11 | Ryzen 7 3800 | 8 | Desktop 128GB RAM |
00:01:51 | MacOs 14.5 | M3 Max | 14 | MacBook Pro 36GB RAM |
I realise that the virtual environment may have different workloads in the background, but the results were fairly consistent over several different runs at different times and the differences are large! 99% of the execution time is spent in the code containing no file reads or writes.
Thank you for pointing me to ways in which I can improve performance in the virtual environment, because that is the common resource in my environment.