Huge performance improvement by separating function?

function testThread00()
	a = zeros(Int, 1000)

	for i=1:length(a)
		for j = 1:1000000
			a[i] = j+i
		end
	end

	println(sum(a))
end

function testThreada0()
	a = zeros(Int, 1000)
	testThread_10(a)

	println(sum(a))
end

function testThread_10(a::Array{Int})
	for i=1:length(a)
		for j = 1:1000000
			a[i] = j+i
		end
	end
	nothing
end
$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0 (2018-08-08 06:46 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> include("testThreads.jl")
testThread_1 (generic function with 1 method)

julia> @timev testThread00()
1000500500
  0.463337 seconds (69.65 k allocations: 3.470 MiB)
elapsed time (ns): 463337432
bytes allocated:   3638054
pool allocs:       69635
non-pool GC allocs:13

julia> @timev testThread00()
1000500500
  0.370684 seconds (13 allocations: 8.297 KiB)
elapsed time (ns): 370683786
bytes allocated:   8496
pool allocs:       12
non-pool GC allocs:1

julia> @timev testThreada0()
1000500500
  0.031791 seconds (28.52 k allocations: 1.440 MiB)
elapsed time (ns): 31790655
bytes allocated:   1509658
pool allocs:       28522
non-pool GC allocs:2

julia> @timev testThreada0()
1000500500
  0.000027 seconds (13 allocations: 8.297 KiB)
elapsed time (ns): 27118
bytes allocated:   8496
pool allocs:       12
non-pool GC allocs:1

I know the documentation recommends putting core computations in separate functions, and it sounds it has more to do with type matching. But in my test, there is no type uncertainty involved. Why is there a huge difference in performance?

1 Like

Interesting. I can reproduce.
Modifying the functions to return the value instead of print,

julia> using BenchmarkTools

julia> @btime testThreada0b()
  1.175 μs (1 allocation: 7.94 KiB)
1000500500

julia> @btime testThread00b()
  253.125 ms (1 allocation: 7.94 KiB)
1000500500

I think the second version manages to elide bounds checks, while the first does not. I don’t know why that is, but:

julia> function testThread00ib()
               a = zeros(Int, 1000)
       
               @inbounds for i=1:length(a)
                       for j = 1:1000000
                               a[i] = j+i
                       end
               end
       
               sum(a)
       end
testThread00ib (generic function with 1 method)


julia> function testThread_a0ib()
               a = zeros(Int, 1000)
               testThread_10ib(a)
       
               sum(a)
       end
testThread_a0ib (generic function with 1 method)

julia> function testThread_10ib(a::Array{Int})
               @inbounds for i=1:length(a)
                       for j = 1:1000000
                               a[i] = j+i
                       end
               end
       end
testThread_10ib (generic function with 1 method)

julia> @btime testThread00ib()
  946.156 ns (1 allocation: 7.94 KiB)
1000500500


julia> @btime testThread_a0ib()
  947.273 ns (1 allocation: 7.94 KiB)
1000500500

Now they’re the same fast (and faster than either version before).

The @code_native and @code_typed are very different for the two of them.
I think @Elrod is right,
it does seem to be avoiding many bounds checks.

The @code_typed shows testThread00()
to be full of



%88  = (Base.slt_int)(%58, 0)::Bool                  
   │           (Base.ifelse)(%88, 0, %58)::Int64 

which is checkeding for if the value is <0.

Thanks for the analysis.

I was actually testing about the improvement by using threads. With these results, it appears threads are irrelevant. It actually hurts to use multi-threads. See the two additional tests:

function testThread()
	# a = SharedArray{Int}(1000)
	a = zeros(Int, 1000)

	for i=1:length(a)
		Threads.@threads for j = 1:1000000
			a[i] = j+i
		end
	end

	println(sum(a))
end

function testThread1()
	# a = SharedArray{Int}(1000)
	a = zeros(Int, 1000)

	Threads.@threads for i=1:length(a)
		for j = 1:1000000
			a[i] = j+i
		end
	end

	println(sum(a))
end
$ export JULIA_NUM_THREADS=2
$ julia --check-bounds=no --math-mode=fast
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0 (2018-08-08 06:46 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> Threads.nthreads()
2

julia> include("testThreads.jl")
testThread_1 (generic function with 1 method)

julia> @timev testThread00()
1000500500
  0.072504 seconds (69.64 k allocations: 3.428 MiB)
elapsed time (ns): 72503629
bytes allocated:   3594534
pool allocs:       69635
non-pool GC allocs:10

julia> @timev testThread00()
1000500500
  0.000039 seconds (13 allocations: 8.297 KiB)
elapsed time (ns): 38903
bytes allocated:   8496
pool allocs:       12
non-pool GC allocs:1

julia> @timev testThread()
999000500
  0.047719 seconds (65.30 k allocations: 3.341 MiB)
elapsed time (ns): 47718671
bytes allocated:   3502871
pool allocs:       65288
non-pool GC allocs:15

julia> @timev testThread()
1000500500
  0.000498 seconds (1.01 k allocations: 55.172 KiB)
elapsed time (ns): 498414
bytes allocated:   56496
pool allocs:       1012
non-pool GC allocs:1

julia> @timev testThread1()
1000500500
  0.058109 seconds (61.24 k allocations: 3.134 MiB)
elapsed time (ns): 58108797
bytes allocated:   3286419
pool allocs:       61232
non-pool GC allocs:9

julia> @timev testThread1()
1000500500
  0.011405 seconds (14 allocations: 8.328 KiB)
elapsed time (ns): 11405298
bytes allocated:   8528
pool allocs:       13
non-pool GC allocs:1

Using SharedArray makes not much difference.

It seems like in some cases the compiler is able to remove the inner loop, the one over j:

function foo()
	a = zeros(Int, 1000)
	for i in 1:length(a)
		for j in 1:1_000_000
			a[i] = j+i
		end
	end
	return sum(a)
end

function foo2()
	a = zeros(Int, 1000)
	for i in 1:length(a)
		for j in 1:1_000
			a[i] = j+i
		end
	end
	return sum(a)
end


function bar(N)
	a = zeros(Int, 1000)
	for i in 1:length(a)
		for j in 1:N
			a[i] = j+i
		end
	end
	return sum(a)
end
julia> @btime foo()
  290.887 ms (1 allocation: 7.94 KiB) # note ms
1000500500

julia> @btime foo2()
  299.027 μs (1 allocation: 7.94 KiB)  # note μs
1500500

julia> @btime bar(1_000_000)
  1.047 μs (1 allocation: 7.94 KiB)
1000500500

The first N-1 iterations of the inner loop is irrelevant, so you can skip straight to the last one. I don’t get why this happens in one case but not the other.

And it gets worser if you try it in Julia 1.0.

Julia Version 0.7.0
Commit a4cb80f3ed (2018-08-08 06:46 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

julia> @btime foo()
  264.405 ms (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  264.273 μs (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  1.029 μs (1 allocation: 7.94 KiB)
1000500500
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

julia> @btime foo()
  11.564 s (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  10.779 ms (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  11.184 s (1 allocation: 7.94 KiB)
1000500500

That’s odd, I get these times on 1.0:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

julia> @btime foo()
  324.058 ms (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  654.000 μs (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  1.500 μs (1 allocation: 7.94 KiB)
1000500500

So from that it seems like the optimisation is still there. 0.7 and 1.0 should also both have the same times, as the only difference there should be removed deprecations.

My timings were from 1.0, so you have something else going on there.

unfortunately not in my case

That is indeed the weirdest part…
I would guess MacOS Mojave has something to do with. What is it though? Is there something more than removal of deprecations in 1.0 version?
I’ve tried to remove the 1.0 completely and run clean install but issue prevails.

On my Xeon I get results like DNF or Sukera.

julia> versioninfo()
Julia Version 1.1.0-DEV.74
Commit f6c48ebf83 (2018-08-20 09:16 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> @btime foo()
  267.387 ms (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  269.914 μs (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  994.300 ns (1 allocation: 7.94 KiB)
1000500500

I just tried Julia 0.6.3 and Julia 1.0 on my MacBook with macOS Mojave and cannot confirm slowbrains findings:

Julia 0.6.3 on macOS Mojave

julia> versioninfo()
Julia Version 0.6.3
Commit d55cadc350 (2018-05-28 20:20 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

julia> @btime foo()
  329.445 ms (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  324.058 μs (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  326.214 ms (1 allocation: 7.94 KiB)
1000500500

Julia 1.0.0 on macOS Mojave

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

julia> @btime foo()
  315.616 ms (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  321.118 μs (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  1.334 μs (1 allocation: 7.94 KiB)
1000500500

Still, very confused why the loop optimisation is only working for bar :confused:

For me it is reproducible on two macOS Mojave (18A371a) installations.
I’ve checked on 1.0.1-pre.0 Commit 1dd2f8b397 and 1.1.0-DEV.126
Commit 5013cfa897… No problem with performance there.

Never mind… let’s focus on loop optimisation problem

edit: whole problem was not having -O3 on

FWIW, this is exactly what is described in performance tips: https://docs.julialang.org/en/v1/manual/performance-tips/#kernal-functions-1

1 Like

No, I don’t think so. That example is about using a function barrier to mitigate a type instability. That’s not the issue here.

2 Likes

I get totally different timings (first is fastest):

julia> @btime foo()
  589.359 ns (1 allocation: 7.94 KiB)
1000500500

julia> @btime foo2()
  753.897 ns (1 allocation: 7.94 KiB)
1500500

julia> @btime bar(1_000_000)
  718.400 ns (1 allocation: 7.94 KiB)
1000500500

What are your system specs?

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

I use -O3 --check-bounds=no --math-mode=fast these usually enable more optimizations not available by only -O3.