Questions on a number of code acceleration techniques

performance
hpc
parallel

#1

I have been trying to step up my game in Julia by thinking more parallel and speed to make the fastest version of my code, and I find myself faced with a number of not-so-thoroughly-documented options so I have a few questions:

  1. I understand that @inbounds drops checking the bounds when accessing an array element, but why would not using that be ever a use case? I assume every functioning piece of code should not have any array cross its bounds. So after testing my code, and making sure no out of bounds error can happen, isn’t it a no-brainer to just add in @inbounds to speed things up a little?

  2. When would @fastmath be a good or a bad idea?

  3. If I understand correctly @simd seems to attempt vectorizing a loop when each iteration is independent from the rest. So how does this accelerate things given Julia’s cooperative threading which only switches to another thread when the active thread is blocked or waiting, as far as I understand. And how similar or different is this to the Julia backend of GPUArrays and to the @async-@sync pattern?

  4. @spawnat vs Channel. The first one is about calling a function in another processor, and channels manage communications between procs on a more fine grained level, right? Any nice comparison of these would be appreciated.

  5. MPI.jl is the standard tool for distributed programming in Julia, right? Is there any other perhaps easier tool for managing distributed programs, either in a cluster or using AWS?

As I am sure you can tell, my experience in parallel and distributed programming is somewhere between none and limited. Any other ideas, tools or docs that I missed and that might be useful to include here would be nice to add. I am sure this post will show up in a google search by anyone in my current shoes in the future, so kindly be thorough. Thanks a lot!


#2

The difference is usually imperceptible, (single digit percentages). And adding @inbounds would be pretty much like claiming you’re a better programmer than every other programmer in existence (who don’t add it because we – yes, including myself – know that our code is riddled with latent bugs).

It depends. Primarily, this just lets it reorder expressions. So if 1 * x * 2 is too slow, you could add @fastmath, or just simplify the expression yourself.

Vectorization happens at the instruction level. For example, instead of computing x[1] = y[1] + z[1]; x[2] = y[2] + z[2] it can essentially compute x[1:2] = y[1:2] + z[1:2] and double the data throughput. Actually, this is where @fastmath can come in helpful too (although @simd will already enable it to a limited extent), since there are other cases where reordering the computation could change the answer slightly. Thread-level parallelism (including gpu, and @async processes) lets processors execute different instructions doing different things. Vectorization can only do the same thing to lots of data.

I’ll let others chime in to answer the rest.


#3

I think the problem with these is you’re looking at them from the point of view of a numerical scientist looking for speed, and are not seeing that for each of these the default is set so that way people don’t shoot themselves in the foot, or they exist for other applications.

If you @inbounds something that actually indexes outside of the bounds and for example mutate that value, Julia can instantly segfault. That’s a good reason to have it on by default, especially since in many instances it doesn’t make a difference. You should only add it in instances where you’re certain you cannot go out of bounds.

It’s a good idea if you want fastmath, but it might drop some accuracy (like 3 ulp if I’m not mistake) with some operations. If you’re working on something like differential equations, that’s far below discretization error so it’s a good thing to do. If you’re working on some algorithm which is supposed to be exact in some sense, it’s an awful thing to do. Since Julia cannot know in advance what your application is, it makes sense for it to default to accuracy and turn on fastmath as needed.

@jameson explained it well. It has nothing to do with the other forms of parallelism. In fact, it’s common for things like GPUs and Xeon Phis to exploit lots of SIMD vectorization in their operations. But in general, don’t use @simd unless you really know what you’re doing. I find that the default -O3 setup auto-vectorizes better than most people do.

Channel is a green thread so it’s asynchronous but not “parallel”. @spawnat is actual multiprocssing. The first is good for things like io-bound tasks, while the second is good for compute-bound tasks. Numerical computing is normally compute-bound so it usually doesn’t use green threads, but other applications like web servers do.

Julia’s multiprocessing is probably the standard tool. pmap, @spawnat, @parallel. MPI.jl is standard for distributed programming in general, so feel free to use it if you’re used to it. Quite frankly, I have written enough MPI in my life to know I don’t want to use it. For most parallelism tasks, it’s too low level of a control: using things like futures is much easier and won’t impact runtimes, especially in codes where the parts you have data transfer is minimal (which I find to be almost every application that is not low-level parallelism of numerical linear algebra).


#4

I think Stefan posted an example a few months ago showing how you could drop all of the ulp. Of course, it also drops NaN, Inf, signed zero, and denormals, so if your algorithm accidentally strays into those areas, the answers are undefined.


#5

@jameson thanks for your answers.

But doesn’t this increase the allocations? I used to think that x[1] = y[1] + z[1]; x[2] = y[2] + z[2] would be faster because there are fewer allocations. Besides this should only lead to an acceleration if there was a significant overhead in starting a new instruction in Julia, so that increasing the throughput and repeating the same instruction for all the data is somewhat faster than doing let’s say I1 then I2 on the first data item, followed by I1 and I2 on the second data item, etc. where I1 and I2 are 2 instructions to be done on all the data. Is this line of thought correct?


#6

No, what @jameson wrote is different. Essentially what he’s saying is the SIMD means that the processor is performing multiple mathematical operations at the same time. He’s using Julia shorthand for it.


#7

@ChrisRackauckas thank you for your comments.

So it’s basically the way to go for any low accuracy application, but then using a lower capacity number type like Float16 or something would probably be faster, but I get it.

I saw the conditions that a loop must satisfy before using @simd in the docs, so I guess that’s what you mean by “really know what you’re doing”, isn’t it? Also what is that -O3 setup?

I see so that’s just for generators mostly I assume, which provide some asynchronousity but not real parallelism, generating data when demanded kind of thing, I think I get it.

I see, so these can also be used to manage data transfer and function calls to remote computers, with the help of distributed arrays and network IO, I assume. Is there any nice example in the Julia modules you know which I can read for demos on using these tools for distributed programming? I am not exactly a networks guy!


#8

Ok I see, but the overhead argument still applies supposedly but for processor instructions not Julia instructions.


#9

Lower precision is different. @fastmath can drop a few digits of accuracy in something like ^ and as @jameson pointed out it can discard handling tough edge cases like subnormal numbers, but I would say that in standard applications it’s still “safe” in many cases where Float32 isn’t “safe”. Though you can get more SIMD parallelism out of smalling floating point numbers so yes those are much more efficient if you can use them.

No. “Really know what you’re doing” means you know a lot more than the manual. Try it yourself. In many cases @simd will slow code down. -O3 is the compiler optimization level which will add SIMD automatically when it makes sense. Just check: it’ll usually beat you.

This is not exactly networking, it’s just multiprocessing. However, different processes can be on different computers, so this is how you parallelize across different nodes of a cluster. Just see the demos on pmap and @parallel for getting started. You literally just add process and it “just works” (make sure your algorithm is properly parallelized though). I have a blog post for using the --machinefile setup:


#10

I see thanks alot. I will read it and try to test it using Amazon Web Services, I hope it all “just works”!


#11

The --machinefile setup is for clusters. For AWS just use the -p setup or just use addprocs() when in Julia.


#12

One quite neat example of @fastmath is in LossFunctions.jl .

Long story short, in the package we define a generic P-th power absolute distance loss, which is basically just the very simple family of functions L(r) = |r|^P. So far there is no problem, because in Julia we can implement such an abstraction quite easy in a general way. However, a quite common convention for the squared distance loss (P=2) is to half the result L(r) = 1/2 |r|^2. From the perspective of a software engineer this little digression is quite annoying because something like this usually puts one in the position of having to choose between a nice coherent simple design, or a more “useful” but hackish implementation that breaks the design in various places.

With @fastmath we were able to avoid that problem by introducing a decorator type called ScaledLoss that just prefixes the computation with a constant scale factor

@fastmath deriv(l::ScaledLoss{<:Loss,K}, r::Number) where {K} = K * deriv(l.loss, r)

So why fastmath? Well the derivative of |r|^2 will be 2*r, so if we scale by 1/2 we would get 1/2 * (2 * r), which qualitatively should just be r. Yet, because floating point multiplication is not necessarily associative, our ScaledLoss abstraction layer would cause two unnecessary multiplications.

With @fastmath and @inline, however, the two constants fold away. This means that despite these layers of abstraction we suffer no performance penality as the following code shows.

julia> const LeastSquaresLoss = LossFunctions.ScaledDistanceLoss{L2DistLoss,0.5}
LossFunctions.ScaledDistanceLoss{LossFunctions.LPDistLoss{2},0.5}

julia> @code_llvm deriv(L2DistLoss(), 0.0)
define double @julia_deriv_60852(double) #0 !dbg !5 {
top:
  %1 = fmul double %0, 2.000000e+00
  ret double %1
}

julia> @code_llvm deriv(LeastSquaresLoss(), 0.0)
define double @julia_deriv_60857(double) #0 !dbg !5 {
top:
  ret double %0
}

To me this is quite impressive about Julia, because consider how simple the functions are that we reason about here. In most languages I have worked with any kind of abstraction around a function that simple would cause noticeable overhead such as a non-linlined function call, or worse, a chain of virtual method lookup.

tldr @fastmath allows in certain scenarios very deep abstraction layers without any performance penalty.