JIT compilation does not avoid useless computations

Hello,

I had a naïve idea bout the JIT compiler. This is that when I define a function with independent return cases it will know with computations to perform, that is:

 function test(z::Bool=true,array::Vector=rand(1000))
           local x = sum([y*y for y in array])  
           if z
              return x
           else
              return 10.0
           end
       end

I would expect that the compiled code for the function test(false) is something like return 10.0, and the compiler would just avoid doing the array’s summation because its result is unused in the scope, and get trashed after.
But it does not, and the test(false) and test(true) are almost equivalent in the llvm_code and in the computation time.

I know that I can optimize that function myself, for example putting the summation in the if and even defining a subroutine that makes the summation. But still, I would have liked this type of inference to be achieved by the compiler.

Can you make it a bit clearer for me?
Thanks

First thing is that the compiler chooses method implementations based on types, and not values. So the test(false) and test(true) would call the same method of the test function. You could get around this by using Val types, a type that encodes a value, though it would still not work as you described as far as I know.

Here is an example of how it could look with Val, I know this is not what you where after but wanted to show how you can propagate values using the types. Though I know this can cause some problems as well and shouldn’t be overused.

julia> _test(::Val{true}, array::Vector) = sum(abs2, array)
_test (generic function with 2 methods)

julia> _test(::Val{false}, array::Vector) = 10.0
_test (generic function with 2 methods)

julia> test(z::Bool=true, array::Vector=rand(1000)) = _test(Val(z), array)
test (generic function with 3 methods)

julia> a = rand(1000);

julia> @btime test(true, a) setup=(a=rand(1000))
  364.286 ns (0 allocations: 0 bytes)
330.6842655375433

julia> @btime test(false, a) setup=(a=rand(1000))
  236.866 ns (0 allocations: 0 bytes)
10.0

julia> @code_lowered test(true, rand(1000))
CodeInfo(
1 ─ %1 = Main.Val(z)
│   %2 = Main._test(%1, array)
└──      return %2
)

julia> @code_lowered _test(Val(true), rand(1000))
CodeInfo(
1 ─ %1 = Main.sum(Main.abs2, array)
└──      return %1
)

julia> @code_lowered _test(Val(false), rand(1000))
CodeInfo(
1 ─     return 10.0
)

As for the reason why the compiler can’t deduce what you want it to do and kind of fix it within the boundaries of the language (i.e. still no dispatch on value, just remove the calculation if it is not needed) I can’t really answer. It would probably be possible to make the compiler figure out that no mutations are made in the shared block of code, and that none of the variables from the shared block are used if z==false, and then deduce that the shared block should only run if z==true and move it into that part.

But to me this feels like the compiler start doing things without my consent that I’m not in control of. What if I add a second if-else inside the false branch that depends on another bool parameter.

function test(z::Bool=true,y::Bool=false,array::Vector=rand(1000))
   local x = sum([y*y for y in array])  
   if z
      return x
   else
      if y
         return 10.0
      else
         return x
      end
   end
end

Should the calculation of x be left at the start now, or should it be lifted inside both z==true as well as z==false && y==false, now we suddenly have more code instead and might be more susceptible to cache misses for the code, which probably don’t matter in this case but for bigger applications it might.

It quickly becomes complex and maybe unclear which is the “best” choice, so my guess is that this is the reason to leave these choices to the programmer. If you want to duplicate the code since you think that is faster or better for your case, do that, if running the code is somehow cheaper than keeping duplicates then you can choose that.

Not sure if I made it clearer, and again not my field so if anything sounds wrong it might very well be :slight_smile:

1 Like

I’m curious, can’t test now. What happens with the original function if using Vals for dispatch?

And, additionally, if it is written without generating the mutable temporary array (which is unnecessary):

struct True end
struct False end
function test(z,array::Vector=rand(1000))
           local x = sum(y*y for y in array)  # remove braces
           if z isa True
              return x
           else
              return 10.0
           end
end

test(True(), rand(10))

The Julia compiler (currently) is not very good on scape analysis of mutable arrays.

The kind of optimization you are looking for is called sinking Code motion - Wikipedia, One would have to check the compiler passes that julia uses to see if we try to do that, I don’t think we currently do. To make it work we would have to prove that the functions called to get x are pure, and that nothing escapes the scope.

2 Likes

I played a little around. No solution for the llvm stuff, but to lean the ‘false’ case it is also usefull to avoid the expensive rand() in the default value.

 function test3(z::Bool=true, array::Vector{Float64} = Vector{Float64}()) 
           if z
              isassigned(array) || (array = rand(1000))
              return sum([y*y for y in array]) 
           else
              return 10.0
           end
       end

Thanks for giving a name to it.

Indeed, I thought that code sinking was in the Julia compiler and that using the pure function style would have made it possible for the compiler.

Best

Oftentimes using such a style will enable this kind of optimization, but here I believe that

is not considered “pure” from the compiler’s perspective. If we use a slightly tweaked version of your function:

function test2(z::Bool, array::AbstractVector)
  x = sum(abs2, array)  
  if z
    return x
  else
    return 10.0
  end
end

The sum (or in the original function, the array comprehension, will eventually call jl_alloc_array_1d in the C(++) runtime. To my knowledge, the compiler has very little knowledge about what this function does side effect-wise and whether it is safe to move into, say, a conditional. Indeed, the generated native code unconditionally calculates the sum before running the conditional:

julia> @code_native test2(true, rand(2))
	.text
; ┌ @ REPL[19]:1 within `test2`
	pushq	%rbx
	movl	%edi, %ebx
; │ @ REPL[19]:2 within `test2`
; │┌ @ reducedim.jl:890 within `sum`
; ││┌ @ reducedim.jl:890 within `#sum#733`
; │││┌ @ reducedim.jl:894 within `_sum`
; ││││┌ @ reducedim.jl:894 within `#_sum#735`
; │││││┌ @ reducedim.jl:322 within `mapreduce`
; ││││││┌ @ reducedim.jl:322 within `#mapreduce#725`
; │││││││┌ @ reducedim.jl:330 within `_mapreduce_dim`
	movabsq	$_mapreduce, %rax  # <-- eventually calls into C to allocate memory
	movq	%rsi, %rdi
	callq	*%rax
; │└└└└└└└
; │ @ REPL[19]:3 within `test2`
	testb	$1, %bl
	jne	L37
; │ @ REPL[19] within `test2`
	movabsq	$.rodata.cst8, %rax
	vmovsd	(%rax), %xmm0                   # xmm0 = mem[0],zero
; │ @ REPL[19]:4 within `test2`
L37:
	popq	%rbx
	retq
	nopw	(%rax,%rax)
; └

But if you try with a type that doesn’t require any heap allocation or other black box operations, the compiler can guard the sum calculation behind the conditional:

julia> using StaticArrays

julia> @code_native test2(false, @SArray rand(2))
	.text
; ┌ @ REPL[19]:3 within `test2`
	testb	$1, %dil
	je	L25
; │ @ REPL[19] within `test2`
	vmovupd	(%rsi), %xmm0
	vmulpd	%xmm0, %xmm0, %xmm0
	vpermilpd	$1, %xmm0, %xmm1        # xmm1 = xmm0[1,0]
	vaddsd	%xmm1, %xmm0, %xmm0
; │ @ REPL[19]:4 within `test2`
	retq
; │ @ REPL[19] within `test2`
L25:
	movabsq	$.rodata.cst8, %rax  # <-- $.rodata.cst8 contains the value 10.0
	vmovsd	(%rax), %xmm0                   # xmm0 = mem[0],zero
; │ @ REPL[19]:4 within `test2`
	retq
	nopl	(%rax,%rax)
; └
4 Likes