The closest to geopandas we currently have is this: GitHub - evetion/GeoDataFrames.jl: Simple geographical vector interaction built on top of ArchGDAL. I haven’t really played around with multithreading yet, but I don’t see why it wouldn’t work.
Wow, I’ll definitively look into LoopVectorization!
I just compared the “threading efficiency”. I’m getting an improvement factor of 5 in Julia (at least for this simple code), whereas in OpenMP (which I have used extensively) I remember neither I nor people I know could ever go past 3-ish in a quad-core PC.
I’m shocked. This is definitively surpassing my expectations!
Part of the efficiency gain here is that LoopVectorization
can switch out the special functions from regular ones to simd
versions automatically.
Although Numpy defaults to row-major, it also supports column-major arrays, so no overhead should be needed here.
Oh, that’s right. Actually in my case M.flags returns F_CONTIGUOUS = True.
I guess there is no overhead whatsoever, as I’ve checked out some tests of F2PY vs standalone Fortran and the reported timings were indistinguishable.
Using GPUs is not that hard either
With GeoData.jl you can operate on rasers on GPU and even as GPU kernel arguments (using Adapt.jl adapt
instead of modify
here), this example just uses broadcast :
using GeoData, CUDA, Plots
A = geoarray("somefile.tif")
cuA = modify(CuArray, A) # Copy inner data to GPU
cuA .*= 2 # Broadcast on your gpu
A = modify(Array, cuA) # Copy back to CPU
plot(A) # Still spatial and can be plotted, subsetted etc
I’d like to follow your articles, but I don’t see an RSS feed or any other way to keep up with them on your site. Did I miss it?
My 10980XE seems to choke on memory bandwidth.
Would be great to try it on an A64FX, which is an ARM CPU with 512 bit vectors and 32GB HBM2 memory for many times the memory bandwidth as my 10980XE.
Supposedly, Intel’s Saphire Rapids will bring up to 64GB HBM2 memory to at least some models. Hopefully at least one of these is affordable.
Meanwhile, future Ryzen CPUs will utilize stacking for extremely large L3 caches. The announcement mentioned 192 MiB of L3, which would be enough for a single
julia> sqrt(192*(1<<20) ÷ sizeof(Complex{Float64}))
3547.2400539010605
3500 x 3500 matrix of Complex{Float64}
, meaning at least benchmarks up until there, they should be extremely fast on the chips featuring that.
Maybe we’ll get Zen4 with both AVX512 and 3d-stacking for a massive L3…
Here is what I got on my 10980XE:
julia> for N in 1_000:1_000:10_000
@show N
A = @btime eval_exp($N)
B = @btime eval_exp_tvectorize($N)
@assert A ≈ B
println()
end
N = 1000
1.115 ms (183 allocations: 15.27 MiB)
212.872 μs (2 allocations: 15.26 MiB)
N = 2000
5.471 ms (184 allocations: 61.05 MiB)
2.967 ms (2 allocations: 61.04 MiB)
N = 3000
10.877 ms (184 allocations: 137.35 MiB)
6.953 ms (2 allocations: 137.33 MiB)
N = 4000
18.455 ms (184 allocations: 244.16 MiB)
12.774 ms (2 allocations: 244.14 MiB)
N = 5000
28.005 ms (183 allocations: 381.49 MiB)
19.700 ms (2 allocations: 381.47 MiB)
N = 6000
39.801 ms (183 allocations: 549.33 MiB)
29.753 ms (2 allocations: 549.32 MiB)
N = 7000
53.669 ms (183 allocations: 747.70 MiB)
41.015 ms (2 allocations: 747.68 MiB)
N = 8000
69.922 ms (184 allocations: 976.58 MiB)
53.899 ms (2 allocations: 976.56 MiB)
N = 9000
88.136 ms (184 allocations: 1.21 GiB)
68.288 ms (2 allocations: 1.21 GiB)
N = 10000
108.854 ms (185 allocations: 1.49 GiB)
85.592 ms (2 allocations: 1.49 GiB)
julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Notice that at 1000x1000, the @tturbo
version was around 5x faster, but degraded to around 1.25x as the sizes increased. 2000x2000 is already too large for the 10980XE’s caches.
A 5950X has 64 MiB of L3 cache, meaning it should still do well at 2000x2000 – perhaps beating out my 10980XE despite only having AVX2 and being down 2 cores.
Anyone with a 5950X willing to run the benchmarks?
(Just replace @tvectorize
with @ttubro
, or define const var"@tvectorize" = var"@tturbo"
)
EDIT, comparing our 1000x1000 times for the @tturbo
version, 2600X / 10980XE times:
julia> 1.332 / 0.213
6.253521126760564
so around 6x faster. For 3x the cores and AVX512 vs half-rate AVX2, the 2600X is already doing better than expected.
This post is currently being disucssed on the front page of HackerNews under the title
Julia: Faster than Fortran, cleaner than Numpy
It’s been pointed out @martin.d.maas that the Python version is missing a square root, and the julia version has been changed to have the 100 + im in the exponent, but the python and fortran versions have not.
And check this out (Apple M1 in a Mac Mini)…
julia> for N in 1_000:1_000:10_000
@show N
A = @btime eval_exp($N)
B = @btime eval_exp_tvectorize($N)
@assert A ≈ B
println()
end
N = 1000
5.005 ms (24 allocations: 15.26 MiB)
1.515 ms (2 allocations: 15.26 MiB)
N = 2000
19.180 ms (24 allocations: 61.04 MiB)
6.081 ms (2 allocations: 61.04 MiB)
N = 3000
43.262 ms (23 allocations: 137.33 MiB)
15.970 ms (2 allocations: 137.33 MiB)
N = 4000
77.505 ms (23 allocations: 244.14 MiB)
26.313 ms (2 allocations: 244.14 MiB)
N = 5000
119.022 ms (23 allocations: 381.47 MiB)
40.525 ms (2 allocations: 381.47 MiB)
N = 6000
169.070 ms (23 allocations: 549.32 MiB)
58.247 ms (2 allocations: 549.32 MiB)
N = 7000
228.990 ms (23 allocations: 747.68 MiB)
78.732 ms (2 allocations: 747.68 MiB)
N = 8000
296.763 ms (23 allocations: 976.56 MiB)
103.725 ms (5 allocations: 976.56 MiB)
N = 9000
375.306 ms (23 allocations: 1.21 GiB)
130.583 ms (5 allocations: 1.21 GiB)
N = 10000
459.623 ms (23 allocations: 1.49 GiB)
159.329 ms (5 allocations: 1.49 GiB)
julia> versioninfo()
Julia Version 1.8.0-DEV.54
Commit 6d2c0a7766* (2021-06-19 00:28 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.5.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
The N = 10_000
case requires 100x the work as the N = 1_000
, and the M1 takes about 100x longer.
Meanwhile, the 2600X took about 180x longer, and the 10980XE took 400x longer.
Good scaling with increased problem size on the M1.
We have a lot of stencil benchmarks in this thread, which (because we’re evaluating them for a single time point, meaning there is no temporal locality we can take advantage of) are memory bound. There, the M1 was around 2x faster for solving the ODEs.
It seems to do well on these memory-intensive tasks.
Hi Lee,
I’ll definitively look into that! This site is about one month old, and I’m still tweaking many things. I’ll try to keep you posted about any Julia content here in the Julia forum for the moment. Regards!
Oh wow, I thought nobody was reading this so I uploaded some partial edits, as eigenspace pointed out. I’'l try to incorporate all of the great feedback into an updated version of the post!
One point that was raised in the Fortran forums, which I think is fair, is that apparently you called Fortran from python, and that has a significant overhead, being the issue the python-fortran interface, not the fortran code itself.
ps: call it from Julia
It showed up in HN recently, so not such a big surprise.
@martin.d.maas welcome!
Building on what others above have mentioned, I think it’s important to try to be careful & avoid overstating the performance of Julia.
It’s become a sport for some to show that Julia is not always the fastest etc
Overstatements motivate these types of people…
I agree. I was just sharing my experience. I’ve just learned a lesson in online etiquette for bloggers – especially when making this kind of comparisons – as some people will only read the title of a post.
By the way, a standalone gfortran version linked to MKL is slightly faster (20% or so) than the optimized LoopVectorization version now.
The trade-off is that the Fortran version is 3X the number of lines of code when compared with the Julia version (25 vs 75), and one has to give up on Python integration for this sort of performance, as F2PY currently produces a considerable overhead.
FWIW, the Fortran code was calling allocate
and deallocate
outside of the timing, while the Julia code was 13.5% GC time. The Julia code was also about 13% slower.
This is an obvious candidate/cause for the performance difference.
Also, the problem is memory rather than compute bound. So this memory handling is entirely what determines performance, and not things like the relative performance of the sincos
implementations.
Corrected that. The performance is exactly the same now!
The Julia code using LoopVectorization.jl is just as good as a standalone gfortran program + Intel’s MKL!!