Julia's loop scope behavior differs from Fortran?

@lungben’s solution does what you want then, no?

julia> function TestLogic(Nloops)
           local DataVar
           for i in 1:Nloops
               @show i
               (i==1) && (DataVar = 333)
               @show DataVar
               DataVar += 10^i
               @show DataVar
           end
           nothing
       end
TestLogic (generic function with 1 method)

julia> TestLogic(3)
i = 1
DataVar = 333
DataVar = 343
i = 2
DataVar = 343
DataVar = 443
i = 3
DataVar = 443
DataVar = 1443

Please ignore this, post, the benchmark was wrong because I likely forgot to add -O3 when compiling the fortran code.

I will add some information here for Fortran programmers and, if anyone with a deeper knowledge of the inner workings of the compilers can correct something, please do.

Let us consider this simple function:

function loop(x,n)
  s = 0.
  for i in 1:n
    r = sqrt(i)
    s = s + x*r
  end
  s
end

Which would have the equivalent Fortran code:

double precision function loop(x,n)
  implicit none
  integer :: i, n
  double precision :: x, r
  loop = 0.
  do i = 1, n
    r = dsqrt(dble(i))
    loop = loop + x*r
  end do
end function loop

From the perspective of this thread, the difference is that the variable r in the Julia code is only known locally at each iteration of the loop, while, at least from a synthatic point of view (I won’t speculate on what the compiler actually does), the variable is visible everywhere in the Fortran function.

The fact that the r variable is visible only locally in the Julia loop allows further compiler optimizations than in the Fortran code, apparently. Effectively, the benchmark of these functions if very favorable to the to Julia version:

julia> x = rand() ; n = 10_000_000;

julia> @btime loop($x,$n)
  15.039 ms (0 allocations: 0 bytes)
6.425913494015874e9

julia> @btime ccall((:loop_,"loop.so"),Float64,(Ref{Float64},Ref{Int64}),$x,$n)
  87.733 ms (0 allocations: 0 bytes)
6.425913494015874e9

(the overhead of calling is negligible, and that is easy to test using a small n)

The Fortran version was compiled with:

gfortran -O3 loop.f90 -shared -o loop.so

Evidently that a different compiler could perhaps figure out that r does not need to be visible outside the loop and do a better job in the compilation of the Fortran code. But, in general, the more restricted scope of variables in Julia allows a better performance than the naive translations of the equivalent codes in Fortran.

At least this is has been my experience and these scoping differences are the only reason I can imagine to explain those performance advantages of Julia.

2 Likes

The later compiler stages are smart enough to understand that there is no difference between

function loop(x,n)
  s = 0.
  for i in 1:n
    r = sqrt(i)
    s = s + x*r
  end
  s
end

function loop2(x,n)
  s = 0.
  r = -1.0
  for i in 1:n
    r = sqrt(i)
    s = s + x*r
  end
  s
end

LLVM mem2reg munches that for breakfast. Heck, scoping info is thrown away in @code_lowered already! So, what is the reason for restricting the scope of loop-local variables?

As @StefanKarpinski recently quipped, most scoping related language design decisions in julia are for closures. My addition to this would be “…because compiler support for closures is suboptimal, for very technical implementation-specific reasons” [*].

The code is transformed from expression-tree into SSA-form during lowering. However, closures are emitted and partially optimized in a piece of femtolisp before this happens, operating on the expression-tree. The optimizations in this phase are very brittle and must operate with very impoverished information (e.g. no types, no dom-tree).

So I’d guess that this scoping decision was made in order to simplify that code’s job.

[*] I’m not trying to throw shade here. Jointly optimizing closure layout and code operating on the closures is super hard! And we need to do it in an open-world context, where users can later define new functions operating on the closures. Most compiler techniques optimize code only, and require users to manually optimize data structures.

3 Likes

Yes, that is clear. Does that mean that gfortran is just too bad?

1 Like

It’s documented by the phrase “as if the loop body were surrounded by a let block”, but the examples don’t make it as crystal clear as this simplified version of your method:

function f()
for i in 1:3 # new i every iteration
   if i == 2
      x = 3 # x will only exist at i == 2
   end
   println(@isdefined x) # nifty macro to see if variable exists
end
end
f() # false true false

Like the others said, if you only want to do the computation in one of the iterations but want the variable to survive its iteration, you can make the variable before the loop without doing the computation yet. But mind that this works in a function body and the REPL’s global scope but not a file’s global scope.

1 Like

I need to take some time to check if the documentation would not benefit from a gigantic red box pointing out this for loop scope property. It is not the first question this comes up. And it is also important if you are dealing with closures inside a for loop too. There is a section of the manual that points out this scope property and also that you may reuse an existing variable as iterator with the keyword outer, but maybe it is not very noticeable.

The fundamental problem is that every red box will just increase the information load of the reader, at the expense of other parts. Just like you cannot make a lot of things bold in some text.

I think that having something in the manual is enough, and the implicit expectation is that the whole manual should just be read at some point by a user investing into learning Julia.

2 Likes

I think we already have this discussion before Tamas. Every time I suggest emphasizing/explaining anything you come with the argument this will be overdone and therefore negate any beneficial effects. To what I answer I trust the Julia maintainers to take the decision of accepting or not such changes; avoiding this to be overdone while keeping the cases in which it is beneficial. At this point we save could some effort and agree in having generic replies we copy in each other posts every time this happens.

3 Likes

I agree that readability suffers from bolding and redboxing every little gotcha someone can have coming from another language, but I would say that “the whole manual should just be read” isn’t actually enough. It definitely should be done, but the manual is kinda big. People who have read the manual may fail to pick up or remember some details, and it’s nice to look up a small section quickly.

Fortunately for this topic, you can. Type “loop” into the documentation page’s search bar and the top result IS the section on loop scope rules (oddly the section introducing loops is 3rd). Even the section introducing loops notes the scoping rule and links the scope rules page. So there has been a sufficient effort to let people reach this particular topic in the manual with minimal effort. Personally I think one of the section’s examples can be tweaked to hammer home this topic’s point, but the preceding let section is clear enough.

Really, I suspect a common problem is that people don’t know they can often use the documentation’s search bar to find what they need. Google is often no help because you reach discourse and stackoverflow questions instead of the right manual section, and if the questions are too specific to their poster’s use cases, it easily causes people to believe they need to post their own question.

3 Likes

Scope behavior decisions are made entirely for the sake of user-facing semantics, not at all for the sake of the compiler. If we wanted something that was easy on the compiler, we wouldn’t have closures at all or would have closures that capture by value.

5 Likes

What I take issue with the following:

  1. someone complaining that X is not documented,

  2. when the learn that it is, convert the complaint to not having it prominent enough in the manual.

I think the fundamental problem is people not reading the manual (multiple times, coming back to parts, as they would read a textbook on some nontrivial subject).

Making parts of it more prominent is unlikely to help this; the manual is not for linear reading like a paperback novel. Yet it is suggested like some panacea every time.

OTOH, what the manual could benefit from greatly is rewriting chapters that are very old and had a lot of parts tagged on as the language grew. This often requires nothing more than just careful editing and reviewing issues people run into — I have done this in #38271 and plan to do more.

5 Likes

Exactly. Put simply, no edit to the manual will help if people don’t know how to look up sections, just like with real books. You can put flashing neon lights on the section and it wouldn’t help if the reader hasn’t reached the page.
That’s not to say this can’t be improved ever. I just remembered this topic where it turned out typing == and === in the search bar didn’t return the right sections in Base or Manual.

I think a big issue is that the manual is not at all easy to search. Something simple that might help is to make it easier for people to restrict their searches – for example, when I want to find something in the manual, half the time it’s buried under hundreds of function or type definitions from the base and standard library.

Something as simple as letting me “just” restrict my search to the manual contents would be helpful. (and if that functionality exists, I haven’t been able to find it. Usually, I just rely on Google searches, which invariably take me to a version of the manual years out of date. Then I have to use that as a baseline for lookin in the current manual, and hope that the particular section hasn’t been moved or reorganized.)

The manual has a lot of great content, but (IMHO) it’s too hard to find any of it unless you already know it inside and out.

3 Likes

EDIT: I get the exact same performance between them (and -march=native doesn’t seem to matter).
Compiling with

gfortran -O3 -march=native rootloop.f90 -shared -fPIC -o libloop.so
gfortran -Ofast -march=native rootloop.f90 -shared -fPIC -o libloopfast.so

Yields:

julia> using BenchmarkTools

julia> function loop(x,n)
         s = 0.
         for i in 1:n
           r = sqrt(i)
           s = s + x*r
         end
         s
       end
loop (generic function with 1 method)

julia> function loopfast(x,n)
         s = 0.
         @fastmath for i in 1:n
           r = sqrt(i)
           s = s + x*r
         end
         s
       end
loopfast (generic function with 1 method)

julia> floop(x,n) = ccall((:loop_,"libloop.so"),Float64,(Ref{Float64},Ref{Int64}),x,n)
floop (generic function with 1 method)

julia> floopfast(x,n) = ccall((:loop_,"libloopfast.so"),Float64,(Ref{Float64},Ref{Int64}),x,n)
floopfast (generic function with 1 method)

julia> x = rand() ; n = 10_000_000;

julia> @btime loop($x, $n)
  13.054 ms (0 allocations: 0 bytes)
7.112252395049944e9

julia> @btime floop($x, $n)
  13.054 ms (0 allocations: 0 bytes)
7.112252395049944e9

julia> @btime loopfast($x, $n)
  6.982 ms (0 allocations: 0 bytes)
7.112252395049599e9

julia> @btime floopfast($x, $n)
  6.982 ms (0 allocations: 0 bytes)
7.1122523950496235e9
1 Like

Thanks for the tip, yet now I am trying to reproduce exactly what I did before and I do not see that difference in performance anymore… :neutral_face: … Now I see that -O3 is enough to pair up performances, maybe I forgot to add that when I compiled the code for the first time. How unfair with old good Fortran I have been :slight_smile: .

1 Like

Ha, I admit I didn’t try without -march=native to confirm. I just did and got the exact same performance as with it when using -O3. When using -Ofast, it was faster without -march=native. lol

The problem is that sqrt limits the performance of this function, and that SIMD square roots with 4 or 8 elements aren’t any faster per element than SIMD square roots with 2 elements in terms of number of clock cycles.
That is, on Skylake/Cascadelake-X, sqrt on 1 or 2 Float64 are equally fast, but on 4 takes twice as long, and 8 twice as long again. Note that this is architecture specific. Agner Fog’s instruction tables suggest that sqrtpd on Zen2 CPUs is equally fast, independent of vector sizes, meaning if using -Ofast -march=native should be faster on them than just -Ofast. Similarly, starting Julia normally and using @fastmath should be faster than starting Julia with julia -Cgeneric.

julia> @btime floopfastgeneric($x, $n)
  6.527 ms (0 allocations: 0 bytes)
9.735648448452713e9

julia> @btime floopfast($x, $n)
  6.982 ms (0 allocations: 0 bytes)
9.735648448452318e9

julia> 6.982 * 4.3/4.6
6.526652173913044

My CPU runs AVX(2) and light AVX512 code at 4.3 GHz, and SSE code at 4.6 GHz.
The ratio 4.3/4.6 equals the ratio in relative times.

So, equivalent speed in terms of number of clock cycles needed, but faster clock speed when not using -march=native gives the generic version the edge.

2 Likes

Interesting! So… I must ask: Why do we have scope below function-level, then?

The way it looks like to me is: If we removed scoping of let/for/while within functions, then some code that currently throws UndefVarError would instead do something else. And that’s it – code that doesn’t throw UndefVarError would not change behavior. Or am I missing something?

5 Likes

I don’t know why but a lot of people including me were asking about scope rules in the last few days, maybe astrology is real.

3 Likes

welcome to my life

10 Likes