Improving an algorithm that compute gps distances

That’s not great, I wonder why? What versions of Julia and of LoopVectorization was this on? (If you have a minute, a run without loading LoopVectorization would test whether that is the issue, I guess it could be fussy about hardware.)

Thanks for pulling them together. Interesting that Jax uses so little memory, I guess it doesn’t actually materialise diff_lat as numpy does.

Without LoopVectorization, the result is much more decent:

julia> @btime distances_tullio($a, $b);
551.643 ms (648 allocations: 190.91 MiB)

LoopVectorization was v0.7.2

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)

Jax’s memory footprint is the same as others’, it just works with Float32 arrays while others with Float64.

Thanks, that’s better. Looking things up, i7-3770 does not have AVX2, perhaps that’s the issue.

I meant that, accounting for 32/64, it’s like the efficient Julia algorithms, and unlike the first one / the numpy version, which make some large intermediate arrays. Perhaps this would not surprise someone who knew more about Jax.

But it’s written exactly this way in the first post:

Yes, you are right, my fault.

@mcabbott @Vasily_Pisarev I would love to update my test with your new code but I don’t see most of the functions you are using on this thread. Are they in a gist or can you post the final version? Thanks!

This is my initial investment -_- Also Julia has this on its main page:

Easy to use
Julia has high-level syntax, making it an accessible language for programmers from any background or experience level. Browse the Julia microbenchmarks to get a feel for the language.

So is it a complex tool or is it easy to use?

1 Like

Thank you all @cgarciae and I were trying to get our hands in Julia for some problem we encounter and making benchmarks against numpy and JAX, hopefully we could learn a lot from this thread.

Thanks

1 Like

Hmm. I’d have to argue with this: I took existing Numpy code and just by adding jax.jit decorator I got all this speedup, and it also runs on GPU if you install Jax with cuda support. I found it really easy to use.

Vasily collected lots of them in this post above.

BTW, I managed to dig up an old computer with an i5-3427U, on which distances_threaded_simd seems to benefit a little from @avx, but distances_tullio is a disaster. So perhaps this can be narrowed down… LoopVectorization uses CpuId.jl and should ideally detect these things.

It was claimed upthread that Jax only computed the results lazily, which would mean that the comparison wasn’t very relevant. Can you confirm this?

Yes. Both. It’s super easy to use, as evidenced by the fact that I’m a biologist with no CS training and I’m about to be incredibly productive with it. But eeking out every ounce of performance is a different matter. The fact that both simplicity and complexity can exist in the same language is the strength.

5 Likes

Easy-to-use usually refers to the learning curve. I don’t think Julia is hard, I am just surprised new comers like me get “attacked” for not knowing everything from the begging or reading the whole manual, that just not how you learn these days.

3 Likes

The code is under “Function definitions” spoiler in post

That’s fine. Just next time, please share your benchmarks here when you ask for help and get it. As we see, it benefits everyone that way.

I think the amount of patience depends a bit on how cocksure the newcomer appears to be. I notice that I sometimes get a bit snarky when someone new to the language shows off their–quite understandably–flawed Julia code vs some other language, and proceeds loudly to proclaim how much Julia is behind :wink:

And this happens quite frequently.

I think most people who ask for help get very friendly treatment, but “with great confidence comes sharper feedback.”

6 Likes

I don’t think people are attacking you - but you did put a bunch of posts on social media of benchmarks with some pretty inefficient code…

Julia can require some digging to really tune things, but the end result is typically - things faster then JAX. Sometimes rivaling FORTRAN. So it’s probably best when you run benchmarks but are brand new to using/understanding some of your samples to state that. “I don’t really know julia well but JAX seems fast!”, but that’s not how your tweets read.

There’s a lot of myths the julia community has to continuously battle from the blow-hards in the python community(not everyone is like that - but there’s a lot of reinforcement bias going around). Posts like this just make our lives harder - so that’s probably why you’ve met some friction…

JAX is fast though and yea it is pretty easy to use. But just because code is written one way, JAX, will be doing a lot of optimizations behind your back. Just make sure you compare apples to apples when sharing research publicly or yea people might say “hummmm”…

8 Likes

@Vasily_Pisarev You should use np.asarray as the end since np.array forces a copy of the data. I made these changes and got this numbers using 8 cores:

distances  1.744 s (40 allocations: 286.18 MiB)
distances_bcast  1.464 s (30 allocations: 95.44 MiB)
distances_threaded  330.340 ms (105 allocations: 190.82 MiB)
distances_threaded_simd  150.763 ms (104 allocations: 190.82 MiB)
dist_np_test  1.413 s (39 allocations: 95.37 MiB)
dist_jax_test 259.303 ms (8 allocations: 320 bytes)

code: final.jl · GitHub

Wow! Julia + SIMD is amazing. I am guessing Jax doesn’t use SIMD. As noted on Twitter the allocation numbers for Jax don’t mean anything.

Thanks all! I’ve learned a lot today :smiley:

7 Likes

If you make the arrays CuArrays then the broadcasted form will act on the GPU. ]add CuArrays will even install CUDA for you. I’m curious about the GPU timings.

Some further tweaking over here gets distances_threaded_simd is down to 65 ms (with Float32, on a 6-core CPU) and distances_tullio to 28 ms.

distances_bcast does also work on the GPU (31 ms, on an ancient one). As does distances_tullio, via KernelAbstractions, but this is very slow right now, not sure why.

4 Likes

In the broadcasted version, won’t cos.(lat1) .* cos.(lat2') cause these cosines to be calculated n^2 instead of n, due to the transpose?

1 Like