How to tell VS Code to use the fast math option?


#1

I’ve never used it before but I’ve read we can start Julia with the command line option:

Julia --math-mode=fast

How should we do it if we use Julia from VS Code or Juno with fastmath?
is it enough to add it to the “executable path” option at VS Code settings?
Can we just change the mode from Julia’s session without restarting it?
Is it still not safe to use fastmath? For example with the package MixedModels.
Will it just give slightly different results or can it go completly wrong?

@fastmath 1+2

It’s supposed to use fastmath, but what if I want a package to use fastmath? Can I force it or it depends on how it has been coded?
What performance increase can we expect?
Is it better to use the fastmath option on an individual case basis?


#2

You really don’t want to use --math-mode=fast. You’re telling the compiler to do something that’s fundamentally broken, e.g. https://github.com/JuliaLang/julia/issues/30073. The global flag is much worse than the occasional use of the @fastmath macro, which is far less broken, but still may cause incorrect results. You shouldn’t use --math-mode=fast unless you’re willing to take the risk that all of your results might be wrong.


#3

OK, thanks.

Will it be always broken and we’d better forget it or is there any work to repair it?
If it ever gets repaired… will it be a good idea to use it?

PS:I’ve tried the example

@fastmath cospi.(a)

And it produces the right output.


#4

Yes, but with --math-mode=fast it won’t.


#5

In short, --math-mode=fast tells LLVM it’s allowed to do non-IEEE transformations of code. However, many functions in Julia’s stdlib are implemented in Julia and depend crucially on IEEE semantics. So this option might as well be called --math-mode=broken since you’re likely to get total nonsense answers from many functions. In that issue I’ve proposed one approach that might make things better, but it hasn’t gotten much traction and to be honest, I’m not sure --math-mode=fast is worth saving—it’s a pretty ill conceived feature. In general, you should either do the relevant algebraic simplifications yourself instead of relying on the compiler to do it for you, or use much more pointed compiler hints like @simd which allows re-association of floating operations in loops to allow them to be vectorized. What is the specific problem you’re trying to speed up?


#6

A reason to use --math-mode=fast is that StaticArray’s otherwise has a lot of separate muls and adds. Is it possible for there to be a flag that just says it is okay to fuse additions and multiplications, without opting for the less precise special functions?
Maybe a –math-mode=muladd, just like how that macro applies only that subset of what the fastmath macro allows.

A motivation:

julia> using StaticArrays

julia> A = @SMatrix randn(4,4);

julia> B = @SMatrix randn(4,4);

julia> @code_native A * B
	.text
; ┌ @ matrix_multiply.jl:9 within `*'
; │┌ @ matrix_multiply.jl:75 within `_mul'
; ││┌ @ matrix_multiply.jl:78 within `macro expansion'
; │││┌ @ matrix_multiply.jl:123 within `mul_unrolled'
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ matrix_multiply.jl:9 within `*'
	vbroadcastsd	(%rdx), %ymm4
	vmovupd	(%rsi), %ymm3
	vmovupd	32(%rsi), %ymm2
	vmovupd	64(%rsi), %ymm1
	vmovupd	96(%rsi), %ymm0
	vmulpd	%ymm4, %ymm3, %ymm4
	vbroadcastsd	8(%rdx), %ymm5
	vmulpd	%ymm5, %ymm2, %ymm5
; ││││└└
; ││││┌ @ float.jl:395 within `macro expansion'
	vaddpd	%ymm5, %ymm4, %ymm4
; ││││└
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	16(%rdx), %ymm5
	vmulpd	%ymm5, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm5, %ymm4, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	24(%rdx), %ymm5
	vmulpd	%ymm5, %ymm0, %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm5, %ymm4, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	32(%rdx), %ymm5
	vmulpd	%ymm5, %ymm3, %ymm5
	vbroadcastsd	40(%rdx), %ymm6
	vmulpd	%ymm6, %ymm2, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	48(%rdx), %ymm6
	vmulpd	%ymm6, %ymm1, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	56(%rdx), %ymm6
	vmulpd	%ymm6, %ymm0, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	64(%rdx), %ymm6
	vmulpd	%ymm6, %ymm3, %ymm6
	vbroadcastsd	72(%rdx), %ymm7
	vmulpd	%ymm7, %ymm2, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	80(%rdx), %ymm7
	vmulpd	%ymm7, %ymm1, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	88(%rdx), %ymm7
	vmulpd	%ymm7, %ymm0, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	96(%rdx), %ymm7
	vmulpd	%ymm7, %ymm3, %ymm3
	vbroadcastsd	104(%rdx), %ymm7
	vmulpd	%ymm7, %ymm2, %ymm2
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm2, %ymm3, %ymm2
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	112(%rdx), %ymm3
	vmulpd	%ymm3, %ymm1, %ymm1
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm1, %ymm2, %ymm1
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	120(%rdx), %ymm2
	vmulpd	%ymm2, %ymm0, %ymm0
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vaddpd	%ymm0, %ymm1, %ymm0
; │└└└└└
	vmovupd	%ymm4, (%rdi)
	vmovupd	%ymm5, 32(%rdi)
	vmovupd	%ymm6, 64(%rdi)
	vmovupd	%ymm0, 96(%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopl	(%rax)
;

vs starting with --math-mode=fast:

julia> using StaticArrays

julia> A = @SMatrix randn(4,4);

julia> B = @SMatrix randn(4,4);

julia> @code_native A * B
	.text
; ┌ @ matrix_multiply.jl:9 within `*'
; │┌ @ matrix_multiply.jl:75 within `_mul'
; ││┌ @ matrix_multiply.jl:78 within `macro expansion'
; │││┌ @ matrix_multiply.jl:123 within `mul_unrolled'
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ matrix_multiply.jl:9 within `*'
	vbroadcastsd	(%rdx), %ymm4
	vmovupd	(%rsi), %ymm3
	vmovupd	32(%rsi), %ymm2
	vmovupd	64(%rsi), %ymm1
	vmovupd	96(%rsi), %ymm0
	vmulpd	%ymm3, %ymm4, %ymm4
	vbroadcastsd	8(%rdx), %ymm5
; ││││└└
; ││││┌ @ float.jl:395 within `macro expansion'
	vfmadd213pd	%ymm4, %ymm2, %ymm5
; ││││└
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	16(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm1, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	24(%rdx), %ymm4
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm6, %ymm0, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	32(%rdx), %ymm5
	vmulpd	%ymm3, %ymm5, %ymm5
	vbroadcastsd	40(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm2, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	48(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm6, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	56(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm0, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	64(%rdx), %ymm5
	vmulpd	%ymm3, %ymm5, %ymm5
	vbroadcastsd	72(%rdx), %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm2, %ymm7
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	80(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm7, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	88(%rdx), %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm0, %ymm7
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	96(%rdx), %ymm5
	vmulpd	%ymm3, %ymm5, %ymm3
	vbroadcastsd	104(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm3, %ymm2, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	112(%rdx), %ymm2
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm5, %ymm1, %ymm2
; │││││└
; │││││┌ @ float.jl:399 within `*'
	vbroadcastsd	120(%rdx), %ymm1
; │││││└
; │││││┌ @ float.jl:395 within `+'
	vfmadd213pd	%ymm2, %ymm0, %ymm1
; │└└└└└
	vmovupd	%ymm4, (%rdi)
	vmovupd	%ymm6, 32(%rdi)
	vmovupd	%ymm7, 64(%rdi)
	vmovupd	%ymm1, 96(%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopl	(%rax,%rax)
; └

What does IEEE say about fma instructions? GCC 7.3 seemed willing to use vfmadd with only -O3 -march=haswell (but, it oddly only used SSE, no avx). That is no longer the case with gcc 8.2, however.
I had thought the fused operation should be at least as accurate.

In general, you should either do the relevant algebraic simplifications yourself instead of relying on the compiler to do it for you, or use much more pointed compiler hints like @simd which allows re-association of floating operations in loops to allow them to be vectorized.

It is interesting that they both can have about the same effect on a dot procut.

julia> using BenchmarkTools

julia> x = randn(1024); y = randn(1024);

julia> function dot_product(x, y)
                  out = zero(promote_type(eltype(x),eltype(y)))
                  @inbounds for i ∈  eachindex(x,y)
                      out += x[i] * y[i]
                  end
                  out
              end
dot_product (generic function with 1 method)

julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.799 μs (0.00% GC)
  median time:      1.801 μs (0.00% GC)
  mean time:        1.805 μs (0.00% GC)
  maximum time:     4.146 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function dot_product(x, y)
           out = zero(promote_type(eltype(x),eltype(y)))
           @inbounds @simd for i ∈  eachindex(x,y)
               out += x[i] * y[i]
           end
           out
       end
dot_product (generic function with 1 method)

julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     194.005 ns (0.00% GC)
  median time:      194.193 ns (0.00% GC)
  mean time:        196.170 ns (0.00% GC)
  maximum time:     394.351 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     626

julia> function dot_product(x, y)
                  out = zero(promote_type(eltype(x),eltype(y)))
                  @fastmath @inbounds for i ∈  eachindex(x,y)
                      out += x[i] * y[i]
                  end
                  out
              end
dot_product (generic function with 1 method)

julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     194.174 ns (0.00% GC)
  median time:      194.230 ns (0.00% GC)
  mean time:        196.558 ns (0.00% GC)
  maximum time:     335.214 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     626

julia> @benchmark $x' * $y
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     187.077 ns (0.00% GC)
  median time:      187.396 ns (0.00% GC)
  mean time:        189.133 ns (0.00% GC)
  maximum time:     355.159 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     666

Aren’t these vectorizing loop transformations likely to make the answer more numerically accurate, than explicitly following the serial definition, because it accrues partial sums in parallel?
Or is it folly to try and divide these transformations into the groups “wont cause catastrophic problems” vs “might cause catastrophic problems”?


#7

Could you also add the result for @fastmath @inbounds @ simd altogether, please?


#8

I am surprised it has not been retired with 1.0, given #25028. With it cropping up all the time (eg #30073), it looks like the ultimate footgun.


#9

I have opened another thread to discuss when shouldn’t we use @simd:


#10

It seems to me that “most” functions are OK with being fastmath-ed, and only a subset of specialized functions that do things like

rx = abs(ax-((ax+s)-s))

(taken from the issue above) will break. It seems suboptimal to ask users to either forgo fastmath performance improvements, or asking them to pepper their code with annotations (already adding @inbounds is pretty annoying). A blacklist approach (@nofastmath) on specific functions that are known to be sensitive to the kind of transformations fastmath enables would seem preferable. Of course it’s an invitation to bugs, a pain to maintain, pretty ugly and therefore probably not a high priority, but the convenience of just adding a CLI toggle and not having to modify your (or somebody else’s) code in many places should not be underestimated.


#11

As another example, someone using NaNs to signal something will just get the wrong answer:

julia> _isnan(x) = @fastmath isnan(x)
_isnan (generic function with 1 method)

julia> _isnan(NaN)
false

#12

But it’s not just stdlib code—any usage of floating point that depends on IEEE standard behavior (intentionally or not) in any package or project then would need to defensively use @nofastmath just to guarantee that some caller who has used this global flag doesn’t break their code. That seems entirely perverse. Worse still, since there’s no standard for fast math mode, LLVM could change what it does at any point and there are no guarantees that code that worked with one version of LLVM will continue to work in the future.

In the other direction, things are much saner: code by default can rely on IEEE standard behavior, which, since it is standardized, cannot change willy nilly when you upgrade something—the standard is respected by both compilers and hardware. In cases where some high-performance code needs to indicate that it wants to bend a particular rule, we have ways to let the author indicate that in a well-scoped way. You get both speed and sanity. This is how we’ve been doing things for years now and it works well.

Bottom line: --math-mode=fast should probably be deprecated or at least print a dire warning when you start Julia. The @fastmath macro is fine because it is limited in scope to the code where you explicitly use it.


#13

Of course what you’re saying is correct, and it’s infinitely simpler from the perspective of Base to just remove --math-mode=fast. However, this means that to get fastmath improvements you need to decide whether you want it when writing the code, not when using it. If I’m writing a library for optimization (say), should I use @fastmath to get faster but non-IEEE compliant code? I shouldn’t have to decide that, the user should. --math-mode=fast allows the user to make that choice globally, like people do in usual compiled languages. Is it pretty? No. Does it work? In theory, no. In practice, mostly, except for a few things that look like they can be taken care of by adding two or three annotations.

Here’s a motivating example:

using BenchmarkTools
function mydot(x,y)
    s = 0.0
    @inbounds for i = 1:length(x)
        s += x[i]*y[i]
    end
    s
end
@fastmath function mydot_fastmath(x,y)
    s = 0.0
    @inbounds for i = 1:length(x)
        s += x[i]*y[i]
    end
    s
end
@fastmath function user_dot(x,y)
    mydot(x,y)
end

x = randn(100000)
y = randn(100000)
@btime mydot($x,$y)
@btime mydot_fastmath($x,$y)
@btime user_dot($x,$y)

user_dot is only fast under --math-mode=fast. To make it fast otherwise without the flag, I have to modify mydot (which might be library code I don’t want to touch).

Bottom line: not that much code needs to be IEEE-correct (although @kristoffer.carlsson does have a point with NaN handling). A lot of code should be fast. Making that fast without requiring library code to decide whether they should do fastmath or not is definitely a hack that has and will cause bugs to people using --math-mode=fast, but I’m not sure I see the point in forbidding that. Why not just add a @nofastmath macro, document clearly that people that use --math-mode=fast are on their own and should not expect Julia devs to fix the bugs that will inevitably appear, and let them submit PRs adding @nofastmath to Base and libraries if they find code that’s behaving strangely under fastmath?


#14

No, 100% disagree. Only the library author is in a position to know if their algorithm can allow reassociation and other non-IEEE simplifications safely or not. If it can, they can either do the transformations manually or use the appropriate flag to give the compiler permission. The user is in no position to have any idea if the code they are calling is safe to non-IEEE transformations on.

Your example only shows that the author of mydot probably wants a @simd annotation. The fact that the user can’t override that is not a problem, it’s a good thing. If the user notices this and wants mydot to be faster, they can dev and edit the package and make a PR, which, if it’s considered correct by someone who understands the intention of the code, will be merged and then everyone benefits instead of just those people who are willing to risk getting total garbage answers for the sake of code running a few percent faster.


#15
 ________________________________________
/ You have enabled fast math             \
| optimizations globally. If you know    |
| what you are doing, you probably would |
\ not have done it.                      /
 ----------------------------------------
        \   ^__^
         \  (@@)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

#16

I’m not sure I agree, because users typically use a debug/release workflow. On numerical code I would prototype the algorithm, in which case I want to minimize sources of confusions such as non-IEEE arithmetic and NaN handling, then when I’m convinced it works correctly run it on a larger case, where I don’t care about correct NaN handling (because I’ve made sure that I don’t produce NaNs), and where I would like to be able to run a few percents faster. People have been doing that for centuries in compiled language (you debug with -O0, and run with -O3 -ffast-math). Sane library authors, forced to choose between nothing and @fastmath module MyLib (because, let’s be honest, nobody is going to bother writing and maintaining more fine-grained annotations), potentially risking tricky bugs and bad error reporting if a user function somewhere returns a NaN or something, will err on the side of caution and never enable fastmath, and as a result speed improvements will be left on the table. I’m not saying that a global toggle is the best way to achieve this (possibly a debug/release mechanism would be appropriate?), but at least it’s there.


#17

Here’s the thing though: compiled languages with separate compilation allow for compiling some library without fast math and others with it. That lets you use fast math for just your code while calling libm, FFTW or whatever compiled with IEEE math. In Julia, that’s not how things work since we don’t do separate compilation and generally can’t since we regularly inline code from one place into another. It might be possible to make it work but it would be a lot of work for a pretty questionable feature. Also note that it would not give you what you want anyway, which is being able to force the compilation of library functions with fast math. Of course, you also probably are not getting this in C or FORTRAN.


#18

Brilliant!


#19

This strikes me as a wildly bad idea if you care about your code working correctly. Do you at least have a comprehensive test suite and run it in -O3 -ffast-math mode to see if anything breaks?

Sane library authors, forced to choose between nothing and @fastmath module MyLib (because, let’s be honest, nobody is going to bother writing and maintaining more fine-grained annotations), potentially risking tricky bugs and bad error reporting if a user function somewhere returns a NaN or something, will err on the side of caution and never enable fastmath, and as a result speed improvements will be left on the table.

Except that’s not the choice. Julia gives you all the targeted compiler hints you need to manually accomplish what fast math does without giving up on predictability and correctness. Package authors can and do use explicit algebraic somplification, @simd, code generation macros, and the occasionally local @fastmath to make sure libraries run at optimal speed. All you’re accomplishing by turning on a global fast math flag is pulling out the rug from under those carefully tuned libraries, which often already live on the fine edge of correctness guaranteed by ieee. See all the examples on the Julia repo of global fast math mode breaking the already-well-optimized algorithms.


#20

It seems to me (but I could be wrong) that there’s not that much code that breaks under fastmath. I just tried as an example running julia under --math-mode=fast and running the tests for Optim.jl (including the runtests.jl file directly), the failures seemed relatively minor (results differed by a small error and were counted as wrong).

I think we are working with a very different mental model of users. You seem to be thinking about large, established codes developed by people with intimate knowledge of IEEE standards. I’m talking about small to mid-sized codes (from month-long exploratory projects to that code that survives a grad student and gets passed around in a lab for a few years), that are designed by application scientists with hazy (at best) knowledge of the effects of floating point errors. Talking about “at least having a comprehensive test suite” in this context is a non-starter. That has never stopped anybody from including a -ffast-math in their makefiles, usually without any problems.

Again, whether a package needs @fastmath or not may depend not on the package intrinsically, but on its uses. When I use an optimization package to elaborate an algorithm, I want it to behave as nicely as possible (eg because I may mess up my objective function and return a NaN, which I want caught as fast as possible). Then when I do a “production” run, I’ve already made sure that no NaNs will be produced, and I just want it to be fast.

I probably won’t be able to convince you of the validity of this use-case. I do think the option should be left there (with a severe warning that pretty much nothing is guaranteed) and that people interested in using it should be allowed to try to fix problems arising from it in Base with @nofastmath to code that needs it. If in two years nobody has bothered to, then that’s the answer - nobody is using that feature and it can be removed. But removing it at this stage of the language development seems premature.