I am not getting the same benchmark especially that i have avx-512 on my machine. Plus you are are not using multithreading. I have intel sort faster than julia sort by 6x and 20x with multithreading.
Also look at your post:
I think you should compile with the -mavx512f ..etc flags, you also don’t need to build the library, you can use the templates in src.