Vector of indices vs. for loop to assign values in an array

Let the following initialisation be.

A = zeros(3,3) # 3x3 array of 0
cartId = CartesianIndices(A)[[1 2 5 7]] # cartesian indices at which some values have to be assigned in A
value = rand(length(cartId)) # the values to be assigned in A at location stored in cartId
  • Which one of the following code (#1 or #2) is faster/less ressource consuming ?

  • How to know it ?

Code #1:

# 1
for i = 1:length(cartId)
A[cartId[i]] = value[i]
end

Code #2:

# 2
A[cartId] = value

To find out you typically benchmark functions which do the operations. I have modified your example a bit, so that benchmarking is better to interprete:
using BenchmarkTools

n=100
A = zeros(n,n)
cartId = CartesianIndices(A)[rand(1:n*n,n)]
value = rand(length(cartId)) 

function fill1(A,cartId,value)
	for i = 1:length(cartId)
		A[cartId[i]] = value[i]
	end
end

function fill2(A,cartId,value)
	A[cartId] = value
end

@benchmark fill1($A,$cartId,$value)

@benchmark fill2($A,$cartId,$value)

The results look like:
for fill1:

BechmarkTools.Trial: 10000 samples with 961 evaluations.
 Range (min … max):  86.368 ns … 118.002 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     92.560 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   92.904 ns Β±   2.752 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                       β–…β–ˆβ–‚β–„
  β–β–β–β–β–β–β–β–ƒβ–β–‚β–„β–ƒβ–ƒβ–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–ƒβ–‚β–‚β–‚β–β–‚β–β–…β–‡β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  86.4 ns         Histogram: frequency by time          102 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

for fill2:

BechmarkTools.Trial: 10000 samples with 867 evaluations.
 Range (min … max):  139.677 ns … 212.803 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     142.330 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   143.855 ns Β±   4.422 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–ˆ
  β–β–‚β–‚β–ƒβ–ˆβ–„β–„β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–β–ƒβ–ƒβ–‚β–‚β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  140 ns           Histogram: frequency by time          157 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, the loop is quite a bit faster.

2 Likes

Thank you.

How to understand that loops are faster here ? How the process differs from using fill1 to fill2 ?

The Why is not so easy to explain, for me it’s always some lengthy analysis of the lowered code.

In general loops in Julia are fast. Both your variants will boil down at some point to a loop, but the difference is how much the compiler can create fast code from the given Julia code. (My understanding here is not good enough to be more than vague).

What I do when I want to know what happens I use @code_lowered:

julia> @code_lowered fill1(A,cartId,value)
CodeInfo(
1 ─ %1  = Main.length(cartId)
β”‚   %2  = 1:%1
β”‚         @_5 = Base.iterate(%2)
β”‚   %4  = @_5 === nothing
β”‚   %5  = Base.not_int(%4)
└──       goto #4 if not %5
2 β”„ %7  = @_5
β”‚         i = Core.getfield(%7, 1)
β”‚   %9  = Core.getfield(%7, 2)
β”‚   %10 = Base.getindex(value, i)
β”‚   %11 = Base.getindex(cartId, i)
β”‚         Base.setindex!(A, %10, %11)
β”‚         @_5 = Base.iterate(%2, %9)
β”‚   %14 = @_5 === nothing
β”‚   %15 = Base.not_int(%14)
└──       goto #4 if not %15
3 ─       goto #2
4 β”„       return nothing
)

julia> @code_lowered fill2(A,cartId,value)
CodeInfo(
1 ─     Base.setindex!(A, value, cartId)
└──     return value
)

Well, fill2 looks much shorter, but:

julia> @code_lowered Base.setindex!(A, value, cartId)

CodeInfo(
1 ─      nothing
β”‚   %2 = Base.IndexStyle(A)
β”‚   %3 = Core.tuple(%2, A)
β”‚        Core._apply_iterate(Base.iterate, Base.error_if_canonical_setindex, %3, I)
β”‚   %5 = Base.IndexStyle(A)
β”‚   %6 = Core.tuple(%5, A, v)
β”‚   %7 = Base.to_indices(A, I)
β”‚   %8 = Core._apply_iterate(Base.iterate, Base._setindex!, %6, %7)
└──      return %8
)

At least , we can see, that it is a loop, because of

β”‚   %8 = Core._apply_iterate(Base.iterate, Base._setindex!, %6, %7)

Sounds like a loop, and is a loop.
You can go further by setting some temporary variables in the REPL according to this and see to what @code_lowered Core._apply_iterate ... will be lowered.

It’s a bit tedious, so I skip this. But what fill1 tells you is, that it checks for indices to be correct. As we already know this we can switch this off to squeeze a bit more of performance from it:

function fill1b(A,cartId,value)
	@inbounds for i = 1:length(cartId)
		A[cartId[i]] = value[i]
	end
end

julia> @benchmark fill1($A,$cartId,$value)
BechmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  87.140 ns … 125.926 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     91.770 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   91.517 ns Β±   2.401 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–… ▅▆▁▄ β–‚β–„ β–ƒβ–‚β–‚β–ƒβ–‚β–ˆβ–„β–‚β–‚β–β–‚β–… ▁▁▃▃▃▂▂▄                       β–‚
  β–†β–β–β–β–„β–‡β–β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–†β–†β–‡β–ˆβ–†β–…β–‡β–…β–†β–‡β–†β–†β–‡β–‡ β–ˆ
  87.1 ns       Histogram: log(frequency) by time      99.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fill1b($A,$cartId,$value)
BechmarkTools.Trial: 10000 samples with 987 evaluations.
 Range (min … max):  60.689 ns … 101.013 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     66.160 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   66.188 ns Β±   2.332 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–ˆ        β–‚
  β–‚β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–…β–ƒβ–ƒβ–ƒβ–„β–„β–„β–…β–†β–…β–„β–„β–ƒβ–…β–„β–ˆβ–†β–ˆβ–…β–…β–…β–ƒβ–ƒβ–„β–„β–„β–ˆβ–ƒβ–„β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  60.7 ns         Histogram: frequency by time         73.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This only to distract you from the question why fill1 is faster than fill2.

Performance questions typically attract many people, so lets just wait for the experts.

3 Likes

What you are seeing is the fact that the with the loop the compiler is being able to prove that it won’t access out of bounds areas, while the assignment is doing bounds checks. I was able to see that by checking the assembly code.
By adding an @inbounds in fill2 you can get better performance.

julia> function fill3(A,cartId,value)
               @inbounds A[cartId] = value
       end
julia> @benchmark fill3($A,$cartId,$value)
BenchmarkTools.Trial: 10000 samples with 978 evaluations.
 Range (min … max):  67.740 ns … 108.896 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     68.124 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   68.275 ns Β±   1.277 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–ˆβ–ˆ
  β–‚β–ƒβ–†β–†β–ˆβ–ˆβ–†β–†β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  67.7 ns         Histogram: frequency by time         72.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fill1($A,$cartId,$value)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  76.175 ns … 123.414 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     76.903 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   77.101 ns Β±   1.600 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–β–…β–‡β–‡β–‡β–‡β–β–ˆβ–‡β–‡β–ˆβ–†β–„β–
  β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  76.2 ns         Histogram: frequency by time           81 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes