I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test

The closest to geopandas we currently have is this: GitHub - evetion/GeoDataFrames.jl: Simple geographical vector interaction built on top of ArchGDAL. I haven’t really played around with multithreading yet, but I don’t see why it wouldn’t work.

Wow, I’ll definitively look into LoopVectorization!

I just compared the “threading efficiency”. I’m getting an improvement factor of 5 in Julia (at least for this simple code), whereas in OpenMP (which I have used extensively) I remember neither I nor people I know could ever go past 3-ish in a quad-core PC.

I’m shocked. This is definitively surpassing my expectations!

4 Likes

Part of the efficiency gain here is that LoopVectorization can switch out the special functions from regular ones to simd versions automatically.

2 Likes

Although Numpy defaults to row-major, it also supports column-major arrays, so no overhead should be needed here.

1 Like

Oh, that’s right. Actually in my case M.flags returns F_CONTIGUOUS = True.

I guess there is no overhead whatsoever, as I’ve checked out some tests of F2PY vs standalone Fortran and the reported timings were indistinguishable.

Using GPUs is not that hard either :slight_smile:

With GeoData.jl you can operate on rasers on GPU and even as GPU kernel arguments (using Adapt.jl adapt instead of modify here), this example just uses broadcast :

using GeoData, CUDA, Plots
A = geoarray("somefile.tif")
cuA = modify(CuArray, A) # Copy inner data to GPU
cuA .*= 2 # Broadcast on your gpu
A = modify(Array, cuA) # Copy back to CPU
plot(A) # Still spatial and can be plotted, subsetted etc
3 Likes

I’d like to follow your articles, but I don’t see an RSS feed or any other way to keep up with them on your site. Did I miss it?

My 10980XE seems to choke on memory bandwidth.
Would be great to try it on an A64FX, which is an ARM CPU with 512 bit vectors and 32GB HBM2 memory for many times the memory bandwidth as my 10980XE.

Supposedly, Intel’s Saphire Rapids will bring up to 64GB HBM2 memory to at least some models. Hopefully at least one of these is affordable.

Meanwhile, future Ryzen CPUs will utilize stacking for extremely large L3 caches. The announcement mentioned 192 MiB of L3, which would be enough for a single

julia> sqrt(192*(1<<20) ÷ sizeof(Complex{Float64}))
3547.2400539010605

3500 x 3500 matrix of Complex{Float64}, meaning at least benchmarks up until there, they should be extremely fast on the chips featuring that.
Maybe we’ll get Zen4 with both AVX512 and 3d-stacking for a massive L3…

Here is what I got on my 10980XE:

julia> for N in 1_000:1_000:10_000
           @show N
           A = @btime eval_exp($N)
           B = @btime eval_exp_tvectorize($N)
           @assert A ≈ B
           println()
       end
N = 1000
  1.115 ms (183 allocations: 15.27 MiB)
  212.872 μs (2 allocations: 15.26 MiB)

N = 2000
  5.471 ms (184 allocations: 61.05 MiB)
  2.967 ms (2 allocations: 61.04 MiB)

N = 3000
  10.877 ms (184 allocations: 137.35 MiB)
  6.953 ms (2 allocations: 137.33 MiB)

N = 4000
  18.455 ms (184 allocations: 244.16 MiB)
  12.774 ms (2 allocations: 244.14 MiB)

N = 5000
  28.005 ms (183 allocations: 381.49 MiB)
  19.700 ms (2 allocations: 381.47 MiB)

N = 6000
  39.801 ms (183 allocations: 549.33 MiB)
  29.753 ms (2 allocations: 549.32 MiB)

N = 7000
  53.669 ms (183 allocations: 747.70 MiB)
  41.015 ms (2 allocations: 747.68 MiB)

N = 8000
  69.922 ms (184 allocations: 976.58 MiB)
  53.899 ms (2 allocations: 976.56 MiB)

N = 9000
  88.136 ms (184 allocations: 1.21 GiB)
  68.288 ms (2 allocations: 1.21 GiB)

N = 10000
  108.854 ms (185 allocations: 1.49 GiB)
  85.592 ms (2 allocations: 1.49 GiB)

julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

Notice that at 1000x1000, the @tturbo version was around 5x faster, but degraded to around 1.25x as the sizes increased. 2000x2000 is already too large for the 10980XE’s caches.
A 5950X has 64 MiB of L3 cache, meaning it should still do well at 2000x2000 – perhaps beating out my 10980XE despite only having AVX2 and being down 2 cores.
Anyone with a 5950X willing to run the benchmarks?

(Just replace @tvectorize with @ttubro, or define const var"@tvectorize" = var"@tturbo")

EDIT, comparing our 1000x1000 times for the @tturbo version, 2600X / 10980XE times:

julia> 1.332 / 0.213
6.253521126760564

so around 6x faster. For 3x the cores and AVX512 vs half-rate AVX2, the 2600X is already doing better than expected.

3 Likes

This post is currently being disucssed on the front page of HackerNews under the title

Julia: Faster than Fortran, cleaner than Numpy

It’s been pointed out @martin.d.maas that the Python version is missing a square root, and the julia version has been changed to have the 100 + im in the exponent, but the python and fortran versions have not.

2 Likes

And check this out (Apple M1 in a Mac Mini)…

julia> for N in 1_000:1_000:10_000
           @show N
           A = @btime eval_exp($N)
           B = @btime eval_exp_tvectorize($N)
           @assert A ≈ B
           println()
       end
N = 1000
  5.005 ms (24 allocations: 15.26 MiB)
  1.515 ms (2 allocations: 15.26 MiB)

N = 2000
  19.180 ms (24 allocations: 61.04 MiB)
  6.081 ms (2 allocations: 61.04 MiB)

N = 3000
  43.262 ms (23 allocations: 137.33 MiB)
  15.970 ms (2 allocations: 137.33 MiB)

N = 4000
  77.505 ms (23 allocations: 244.14 MiB)
  26.313 ms (2 allocations: 244.14 MiB)

N = 5000
  119.022 ms (23 allocations: 381.47 MiB)
  40.525 ms (2 allocations: 381.47 MiB)

N = 6000
  169.070 ms (23 allocations: 549.32 MiB)
  58.247 ms (2 allocations: 549.32 MiB)

N = 7000
  228.990 ms (23 allocations: 747.68 MiB)
  78.732 ms (2 allocations: 747.68 MiB)

N = 8000
  296.763 ms (23 allocations: 976.56 MiB)
  103.725 ms (5 allocations: 976.56 MiB)

N = 9000
  375.306 ms (23 allocations: 1.21 GiB)
  130.583 ms (5 allocations: 1.21 GiB)

N = 10000
  459.623 ms (23 allocations: 1.49 GiB)
  159.329 ms (5 allocations: 1.49 GiB)


julia> versioninfo()
Julia Version 1.8.0-DEV.54
Commit 6d2c0a7766* (2021-06-19 00:28 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.5.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

The N = 10_000 case requires 100x the work as the N = 1_000, and the M1 takes about 100x longer.
Meanwhile, the 2600X took about 180x longer, and the 10980XE took 400x longer.
Good scaling with increased problem size on the M1.

We have a lot of stencil benchmarks in this thread, which (because we’re evaluating them for a single time point, meaning there is no temporal locality we can take advantage of) are memory bound. There, the M1 was around 2x faster for solving the ODEs.
It seems to do well on these memory-intensive tasks.

3 Likes

Hi Lee,

I’ll definitively look into that! This site is about one month old, and I’m still tweaking many things. I’ll try to keep you posted about any Julia content here in the Julia forum for the moment. Regards!

Oh wow, I thought nobody was reading this so I uploaded some partial edits, as eigenspace pointed out. I’'l try to incorporate all of the great feedback into an updated version of the post!

3 Likes

One point that was raised in the Fortran forums, which I think is fair, is that apparently you called Fortran from python, and that has a significant overhead, being the issue the python-fortran interface, not the fortran code itself.

ps: call it from Julia :joy:

6 Likes

It showed up in HN recently, so not such a big surprise.

@martin.d.maas welcome!
Building on what others above have mentioned, I think it’s important to try to be careful & avoid overstating the performance of Julia.
It’s become a sport for some to show that Julia is not always the fastest etc
Overstatements motivate these types of people…

7 Likes

I agree. I was just sharing my experience. I’ve just learned a lesson in online etiquette for bloggers – especially when making this kind of comparisons – as some people will only read the title of a post.

By the way, a standalone gfortran version linked to MKL is slightly faster (20% or so) than the optimized LoopVectorization version now.

The trade-off is that the Fortran version is 3X the number of lines of code when compared with the Julia version (25 vs 75), and one has to give up on Python integration for this sort of performance, as F2PY currently produces a considerable overhead.

2 Likes

FWIW, the Fortran code was calling allocate and deallocate outside of the timing, while the Julia code was 13.5% GC time. The Julia code was also about 13% slower.

This is an obvious candidate/cause for the performance difference.

Also, the problem is memory rather than compute bound. So this memory handling is entirely what determines performance, and not things like the relative performance of the sincos implementations.

6 Likes

Corrected that. The performance is exactly the same now!

The Julia code using LoopVectorization.jl is just as good as a standalone gfortran program + Intel’s MKL!!

14 Likes