# Substituting elements of array is very slow

I am new to Julia and I have problems with the speed. What is the fastest way to replace rows in an array? Currently I iterate over any entry of a matrix and when the result of the associated calculation is nonzero, I store the position (row and col) and the value (aval) in an additional array that will serve for sparse matrix allocation at a later stage of the code.

The code here is a nonsense simplified example to get to the point. On my computer, the replacements in the if condition triple the computation time from around 0.04 seconds to 0.12 seconds. What am I doing wrong here?

``````function testfun02(M)
nrow, ncol = size(M);
spidx = Array{Float64, 2}(undef, nrow, 3);
numit = 1;

for j02 = 1:ncol, j01 in 1:nrow
aval = M[j01,j02]*(j01+j02);
if aval > 0;
spidx[numit,1] = j01;
spidx[numit,2] = j02;
spidx[numit,3] = aval;
numit +=1;
end
end
return(spidx);
end

B = Matrix{Float64}(I, 10000, 10000);
@time testfun02(B);
``````

Do you mean if the function is written like this:

``````function testfun02(M)
nrow, ncol = size(M);
spidx = Array{Float64, 2}(undef, nrow, 3);
numit = 1;

for j02 = 1:ncol, j01 in 1:nrow
aval = M[j01,j02]*(j01+j02)
end

return spidx
end
``````

It takes just 0.04 seconds? If I were you, Iβd be surprised it took that long in the first place - after all, that function isnβt doing any real calculation that can be observed after the call.

Other than that, I think your iteration order is wonky - check Access arrays in memory order, along columns for more info.

1 Like

Exactly. Of course, I know that it does not return anything interesting, I am more interested in the original function, but I wanted to dissect the problem and find out where it takes time. And for this purpose, he function seems reasonable. I checked the iteration order but I think it is fine.

I donβt think so:

``````julia> function testfun04(M)
nrow, ncol = size(M)
spidx = Array{Float64, 2}(undef, nrow, 3)
numit = 1

for j01 = 1:ncol, j02 in 1:nrow
aval = M[j02,j01] * (j01+j02)
end

return spidx
end
testfun04 (generic function with 1 method)

julia> @time testfun04(B);
0.025862 seconds (25.43 k allocations: 1.635 MiB, 99.55% compilation time)

julia> @time testfun04(B);
0.000051 seconds (2 allocations: 234.422 KiB)
``````
2 Likes

Ok. crazy. Thanks. That solves quite a lot.

Sorry, but what changed there? Seemed to me that the iteration order was correct initially already.

Something else that could improve the performance is to define `spidx` as `(3,nrow)`, so the same column is accessed at each iteration for filling the arrays. But in that particular case, with those dimensions, the difference is small.

1 Like

Youβre right - seems like there was some erronous testing on my part, I apologize!

I now also get the very fast speed with the first version (without the branch, of course). Not sure where the slowdown would come from, though Iβd expect the version with the conditional to be somewhat slow, as the branch and `numit` variable prevents SIMD (`numit` introduces a loop dependency on previous iterations).

1 Like

I just checked and it looked like that Julia did not compile and therefore, there was no substantial speed gain. After restarting, my computer, this now works. With the stripped down version of Sukera or mine, I get the quick results.

However, there seems to be a substantial slowdown when I run the full if-condition. And this is what I still donβt understand (something cannot be efficiently compiled β¦)

``````@time testfun02(B);
0.116541 seconds (98.29 k allocations: 4.410 MiB, 18.54% compilation time)
@time testfun02(B);
0.095457 seconds (68.98 k allocations: 2.655 MiB)
``````

There has to be something else going on here. ~70k allocations for a function that (from what youβve posted) doesnβt allocate other than the output array seems very weird to me. Can you post the full code youβre running? Iβve had to do e.g. `using LinearAlgebra` earlier because of the `I` youβre using.

Also, which version are you running? Please post the output of `versioninfo()`:

``````julia> versioninfo()
Julia Version 1.7.0-beta3
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER =
``````
Also, here are some benchmarks using `BenchmarkTools`
``````julia> using LinearAlgebra

julia> using BenchmarkTools

julia> function testfun02(M)
nrow, ncol = size(M);
spidx = Array{Float64, 2}(undef, nrow, 3);
numit = 1;

for j02 = 1:ncol, j01 in 1:nrow
aval = M[j01,j02]*(j01+j02)
end

return spidx
end
testfun02 (generic function with 1 method)

julia> function testfun03(M)
nrow, ncol = size(M);
spidx = Array{Float64, 2}(undef, nrow, 3);
numit = 1;

for j02 = 1:ncol, j01 in 1:nrow
aval = M[j01,j02]*(j01+j02);
if aval > 0;
spidx[numit,1] = j01;
spidx[numit,2] = j02;
spidx[numit,3] = aval;
numit +=1;
end
end
return(spidx);
end
testfun03 (generic function with 1 method)

julia> @benchmark testfun02(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 5 samples with 1 evaluations.
Range (min β¦ max):   6.300 ΞΌs β¦ 41.900 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     13.500 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   16.880 ΞΌs Β± 14.626 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β          β    β                                         ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
6.3 ΞΌs          Histogram: frequency by time        41.9 ΞΌs <

Memory estimate: 234.42 KiB, allocs estimate: 2.

julia> @benchmark testfun03(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 4 samples with 1 evaluations.
Range (min β¦ max):  105.353 ms β¦ 108.477 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     107.536 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   107.225 ms Β±   1.344 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β                                    β          β           ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
105 ms           Histogram: frequency by time          108 ms <

Memory estimate: 234.42 KiB, allocs estimate: 2.
``````
1 Like

Thanks. Here is the complete code I am running. It is actually just the LinearAlgebra and the BenchmarkTools missing

``````using LinearAlgebra
using BenchmarkTools

function testfun02(M)
nrow, ncol = size(M);
spidx = Array{Float64, 2}(undef, nrow, 3)
numit = 1

for j01 = 1:ncol, j02 = 1:nrow
aval = M[j02,j01]*(j01+j02)
if aval > 0
spidx[numit,1] = j01
spidx[numit,2] = j02
spidx[numit,3] = aval
numit += 1
end
end
return spidx
end

B = Array{Float64,2}(I, 10000, 10000);
benchout = @benchmark testfun02(B);
@time testfun02(B);
@time testfun02(B);
``````

The Output is

``````@time testfun02(B);
0.111525 seconds (27.23 k allocations: 1.895 MiB, 17.53% compilation time)

@time testfun02(B);
0.096936 seconds (2 allocations: 234.453 KiB)
``````

The version info is

``````Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
``````

and that looks indeed strange.

Interesting - can you try with a 1.7 beta?

(Aside, you donβt need `;` at the end of every line - that only suppresses automatic output in the REPL, it has no other effect on code and does nothing when the code is run from a file.)

1 Like

What exactly is strange here? Seems reasonable to me. Are you comparing this to some other implementation?

I compared to an older setup where the number of threads were already set to 4. But this is not the issue here. I can easily adjust the numer of threads.

Since I get a much faster execution time than you do and my CPU is older & has less cache than yours does, I suspect the julia or LLVM version makes a difference. Please test with one of the binaries from the 1.7 beta, since that could be a major difference in the code generated.

I think there is some confusion here. Seems to me that the benchmarks reported here are all similar, and similar to what I obtain.

I believe that the execution speed here is memory bounded, and I do not find the execution time particularly surprising.

You apparently run a benchmark with `BenchmarkTools.@benchmark`, but you donβt report its output, only the output of the `@time` macro, which is not part of BenchmarkTools.

Can you report the output from the `@benchmark` or `@btime` macro, and remember to interpolate the input argument, like this

``````@benchmark testfun02(\$B)
@btime testfun02(\$B)
``````
1 Like

This is the output with @benchmark

``````BenchmarkTools.Trial: 53 samples with 1 evaluation.
Range (min β¦ max):  86.669 ms β¦ 136.134 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     91.699 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   95.738 ms Β±  12.138 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β    β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
86.7 ms         Histogram: frequency by time          133 ms <

Memory estimate: 2.35 MiB, allocs estimate: 68980.
``````

Why are the numers so different from @time? Sorry, is probably a stupid question but I am new to Julia and this confuses me. Using @benchmark, I would suspect that there are no issues at all, right?

Sorry, the numbers are at the same order of magnitudeβ¦ 100 ms are 0.1 seconds. But most of the timing problem seemsto come from the substitution of the elements in the array in the if condition. And this, I find, is surprising.

I think the point is that if your function does work that is not observable from outside the function, then the compiler may decide to skip those processing steps, and finish much faster than you expect.

Therefore itβs important, even in benchmarks that arenβt supposed to do anything useful, that the work of the function is observable from the outside.