Example C vs Julia vs Python

You can always startup julia like julia --check-bounds=no to globally turn off bounds checking. I usually prefer to do things like

@inbounds begin
    for pp = 1:SizeZ-1, nn = 1:SizeY-1, mm = 1:SizeX
        Hx[mm,nn,pp] = (Chxh[mm,nn,pp] * Hx[mm,nn,pp] + 
                        Chxe[mm,nn,pp] * ((Ey[mm,nn,pp+1] - Ey[mm,nn,pp]) - (Ez[mm,nn+1,pp] - Ez[mm,nn,pp])))
    end

    for pp = 1:SizeZ-1, nn = 1:SizeY, mm = 1:SizeX-1
        Hy[mm,nn,pp] = (Chyh[mm,nn,pp] * Hy[mm,nn,pp] + 
                        Chye[mm,nn,pp] * ((Ez[mm+1,nn,pp] - Ez[mm,nn,pp]) - (Ex[mm,nn,pp+1] - Ex[mm,nn,pp])))
    end
    
    for pp = 1:SizeZ, nn = 1:SizeY-1, mm = 1:SizeX-1
        Hz[mm,nn,pp] = (Chzh[mm,nn,pp] * Hz[mm,nn,pp] + 
                        Chze[mm,nn,pp] * ((Ex[mm,nn+1,pp] - Ex[mm,nn,pp]) - (Ey[mm+1,nn,pp] - Ey[mm,nn,pp])))
    end
end

if I’m going to be turning off bounds checks on a big chunk of code.

6 Likes

Summary thus far:
Thanks all for the input. I did some experimentation with the various suggestions and my results are posted below. Further input is welcome. However, this progress got me more firmly in the Julia camp. I was beginning to doubt the “runs like C” bit and was starting to develop a C program based on my Julia prototype for another project - glad I can give up on that approach. I suspect my Python code in the original comparison is probably in need of optimization as well, but no time for that.

FDTD Code CPU Time (seconds)								
	    Original	Case A	Case B	Case C	Case D	Case E	Case F	Case G
Compile	    43.6	  33.9	  24.0	   3.7	   2.4	  1.31	  1.20	 0.828
Run 1	    40.9	  33.2	  21.8	   2.9	   2.1	  1.11	  1.09	 0.609
Run 2	    41.7	  32.9	  23.3	   2.8	   2.2	  1.14	  1.11	 0.595
Run 3	    42.6	  32.3	  22.3	   2.5	   2.0	  1.13	  1.08	 0.657
Run 4	    40.1	  32.2	  23.7	   2.8	   2.1	  1.16	  1.11	 0.609
Run 5	    40.5	  33.1	  22.7	   2.9	   2.1	  1.14	  1.12	 0.625
Avg (1-5)	41.2	  32.7	  22.8	   2.8	   2.1	  1.14	  1.10	 0.620
Ratio to C 	96.6	  76.9	  53.4	   6.5	   4.9	  2.7	  2.6	 1.5

  • Original: Written by an EE relatively new to Julia.
  • Case A: =Nested for-loops reordered to process array columns before rows.
  • Case B: =Case A plus define global constants with “const”
  • Case C: =Case B plus move Ce…, Ch… global fill terms within their respective functions.
  • Case D: =Case C plus wrap entire block (other than ‘using…’) in ‘let…end’. Removed ‘const’ from global constants. Ce…, Ch… terms still in respective functions.
  • Case E: =Case A plus ‘let…end’ wrapping all but the ‘using…’ line. Ce… And Ch… terms are not in their respective functions.
  • Case F: =Case E plus ‘@inbound begin…end’ wrapping all but ‘using…’.
  • Case G: =Case E plus ‘@ inbound’ individually on for-loops used to operate on 3D arrays.

The full code block in the first post corresponds to Case B
The full code block in the post by Mason corresponds to Case G

17 Likes

I tried wrapping the whole code block (minus the package import) in ‘@inbounds begin…end’ and didn’t see the same improvement as was evident with application of the macro to individual for-loops. See Case E, F, and G in the summary post.

I’m not sure if the @inbounds macro handles top-level blocks like that. It may need to operate on a function body.

4 Likes

So the pattern I usually use when writing stuff like this is somewhat like this.

function user_facing_function(args...)
    checkbounds(ags...)
    return _nobody_call_this_function_but_me(args...)
end
@inbounds function _nobody_call_this_function_but_me(args...)
   ...
end

But I’m I’m sure there are other (and better) ways of doing this.

3 Likes

Have you tried with the @avx macro? [ANN] LoopVectorization

@inbounds doesn’t work an a function definition. It needs to be inside the function as

function  _nobody_call_this_function_but_me(args...)
    @inbounds begin
        ...
    end
end

What about this?

Sorry, I meant in a function body.

Would it be relatively feasible (and relatively easy) to have a macro that goes through and does standard tricks? e.g.

@makefaster function f(x)
  #   ... macro adds inbounds to all loops
  #  ... also does standard passes that are weakly better
end

And it could potentially have some extra arguments for various passes. For example

@makefaster function f(x), settings = (inbounds=true, avx=true, views=true)
  #   ... macro adds inbounds to all loops
  #  ... also does standard passes that are weakly better
end

Or whatever, which would add in @avx macros, @view, etc. everywhere. Of course, these things are not always faster so being able to toggle is worth the trouble.

This sort of thing would not be a replacement for doing things right, but might be a good way to tell people with minimal julia experience how to get started - and a decent heuristic? Can’t fix issues such as bad type inference problems, but it could add in the mindless annotations.

As discussed with the global scoping, it seems like a way to declare a “script” that tells the compiler it can compile the whole file as a big let would alleviate this common performance issue. Then the heuristic solution to the problem could be to tell people when running as a .jl file to flag it as a script… Remembering to write scripts without globals is pretty tough for people coming from scripting languages.

2 Likes

If such a thing could exist trivially/those optimizations were always doable, don’t you think they would be done already and such a macro wouldn’t be necessary?

The “ugly” truth of @inbounds and friends is that they are not trivial and always allowed - in fact, the docs even discourage it since you’re basically telling the compiler “don’t check for safety here, I know better than you”. The toggle you’re asking for is here, in the form of asking for the dragons and not in the form of making them vanish.

3 Likes

Exactly, the name of such a macro might as well be @makecrashy.

3 Likes

Even in it’s present form, macros like @inbounds are convenient for the inexperienced like myself as they provide a quick way to get some performance benefits from scrappy code. I suppose the trick is to avoid the temptation of letting that be a crutch rather than learning the proper approach.

You might have been joking, but I think that is a great name for that kind of macro. Makes people think twice before trying it. The old @fastmath gave the wrong message entirely. Or @makeunsafe.

No, I wouldn’t think that at all. The compiler could never know when to take off the safety wheels on its own. Those sorts of decisions can never be done automatically.

In effect, though, people are comparing code with safety wheels in Julia with code without it other languages (either directly in C, or sometimes using unsafe C behind a Python interface) … We tell people to look at the performance guide but scripters have trouble doing it for some reason.

So is it really that bad to have a macro called @makecrashy which we can tell people to use to get a sense of whether the safety wheels are slowing them down? Potentially trying a few toggles to see what helps? The end users are lower skilled programmers people with big scripts (rarely organized in functions) , where they are unlikely to be able to scan the code looking for appropriate function and vector annotations.

2 Likes

There already are flags to turn off various things globally, --check-bounds for example.

However, lower skilled programmers don’t need these macros. Their performance problems will come from type instabilities, unnecessary allocations, bad usage of CPU cache, suboptimal algorithms etc. When they are at the point where they write code where @inbounds would matter (basically only when a bounds check would prevent SIMD) they are at the level where they can be properly taught about these macros.

There is no programmer level where a @random_unsafe_operations is valuable. If anything, such a macro would give a wrong impression on how to program.

5 Likes

I fail to see why this couldn’t be done in a cases where you’re iterating over the indexes of an collection that is not modified at all during the loop, e.g.

function compileme(x)
    q = 0
    for i in eachindex(x)
        q += x[i]
    end
end

compileme( (1,2,3,4) )

Since x is never reassigned in the loop and has fixed length, the compiler should know that x[i] will always be inbounds. I’m pretty sure it already does the optimization

In the above tables, note that he gets a significant speedup between example E and G. But I agree that the others are more important.

Beginners dont know this (eg I am not even sure how to do that with jupyter), but more generally just because I am willing to believe a function is safe doesn’t mean I want everything to be run unsafe.

For sure. That stuff can never be automatic, and it is tough for non-programmers to handle. But if you can still get a 2x speedup with a simple macro on some functions, it is worth it.

That could be true, but most people in my field (economics) don’t want to learn to program. So it is one thing if such a macro (giving ways to blindly flip annotations inside a code block) wouldn’t be helpful
From what you say about SIMD, it may not.

But the argument that it end up as a crutch, preventing people from learning to program properly only makes sense if your intended user wants to learn to program properly. I don’t think purposefully inconvenience is a good strategy to teach better patterns… But, all of that is moot if blindly adding in the inbounds, avx, view, etc annotations is rarely helpful.

1 Like

I am not sure about this. Most economists I know who are into computational work actually want learn to program very well, or already did so. Papers that use numerical techniques (for nonlinear solutions, estimation, etc) can easily span years and comprise many 10k lines of code, and become unmanageable without at least intermediate coding skills. Consequently, most projects have at least one coauthor who programs rather well.

Instead, I wish there was less emphasis on the micro-optimizations like @simd, @inbounds, and friends on this forum. Seeing discussions about optimizing code as a newbie, it may be very easy to get the impression that this is the magic where performance comes from, because people participating in these already instinctively apply all the usual performance tips to their code so they are wringing out the last 20%.

But for most users, ignoring @simd etc first and just writing compiler-friendly code with reasonable memory traversal and allocation patterns is best: it will get you very far, with robust performance across Julia versions and the underlying hardware.

18 Likes

Sure, they are there, but not too many of them, and between the two of us we probably know most of them personally. They largely prefer Fortran and C because they it is easy and requires no software training to write reasonably fast Fortran code due to its simplicity and aliasing rules for arrays.

Furthermore, most of the computational code is written by RAs who are mostly self-taught - especially the big projects. The researchers seem to appreciate that if they use languages other than Fortran, they can more easily experiment with better algorithms (which is where the real benefits appear) but have trouble moving past the transition of everything suddenly getting slower.

Sure, some are that size, but software engineering and training is rare. This is a tiny proportion of the number of people I wish would use Julia… and most of them are on Fortran (and occasionally a largely C subset of C++). Since these people are using fortran/C for speed, it is tough to convince them to change to julia if they port something and it is orders of magnitude slower than matlab (let alone Fortran or C). Helping them past that more easily will help them get addicted to julia.

We 100% agree on this. In fact, my proposal of the @makecrashy has this as its goal. i.e. instead of making people think that they need to learn a whole bunch of new rules for when to annotate with various macros, we train them them to: (1) get everything out of globals; (2) some basic heuristics for ensuring type stability; and (3) if it is still “slow” relative to non-julia, try @makecrashy if they want to see if microoptimizations might help in their circumstances… and only learn more or tweak if they need to. The global way to skip inbounds isn’t pervading…

But, as I said before, if a @makecrashy can’t be written to actually help, then this is all moot.

Or, is there a way for the @inbounds able to extend to a whole function and recursively apply to blocks? That would go a long way…

Checking @code_llvm, there seems to be no bound checks in your example and simd is used.
However, when x is a list instead of a tuple, bound checks are still done (preventing simd), even though they are in principle not needed as far as I can see. Adding @inbounds gives a significant speedup (and simd back).
Disclaimer: I tested on Julia 1.2 (Jupyter Datascience Docker image has not been updated yet :frowning:)