Substituting elements of array is very slow

I am new to Julia and I have problems with the speed. What is the fastest way to replace rows in an array? Currently I iterate over any entry of a matrix and when the result of the associated calculation is nonzero, I store the position (row and col) and the value (aval) in an additional array that will serve for sparse matrix allocation at a later stage of the code.

The code here is a nonsense simplified example to get to the point. On my computer, the replacements in the if condition triple the computation time from around 0.04 seconds to 0.12 seconds. What am I doing wrong here?

Thanks in advance.

function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3);
    numit = 1;

    for j02 = 1:ncol, j01 in 1:nrow
        aval = M[j01,j02]*(j01+j02);
        if aval > 0;
            spidx[numit,1] = j01;
            spidx[numit,2] = j02;
            spidx[numit,3] = aval;
            numit +=1;
        end
    end
    return(spidx);
end


B = Matrix{Float64}(I, 10000, 10000);
@time testfun02(B);

Do you mean if the function is written like this:

function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3);
    numit = 1;

    for j02 = 1:ncol, j01 in 1:nrow
        aval = M[j01,j02]*(j01+j02)
    end

    return spidx
end

It takes just 0.04 seconds? If I were you, I’d be surprised it took that long in the first place - after all, that function isn’t doing any real calculation that can be observed after the call.

Other than that, I think your iteration order is wonky - check Access arrays in memory order, along columns for more info.

1 Like

Exactly. Of course, I know that it does not return anything interesting, I am more interested in the original function, but I wanted to dissect the problem and find out where it takes time. And for this purpose, he function seems reasonable. I checked the iteration order but I think it is fine.

I don’t think so:

julia> function testfun04(M)                                                
           nrow, ncol = size(M)                                             
           spidx = Array{Float64, 2}(undef, nrow, 3)                        
           numit = 1                                                        
                                                                            
           for j01 = 1:ncol, j02 in 1:nrow                                  
               aval = M[j02,j01] * (j01+j02)                                              
           end                                          
                    
           return spidx                                
       end                                                                  
testfun04 (generic function with 1 method)                                  
                                                                            
julia> @time testfun04(B);                                                  
  0.025862 seconds (25.43 k allocations: 1.635 MiB, 99.55% compilation time)
                                                                            
julia> @time testfun04(B);                                                  
  0.000051 seconds (2 allocations: 234.422 KiB)                             
2 Likes

Ok. crazy. Thanks. That solves quite a lot.

Sorry, but what changed there? Seemed to me that the iteration order was correct initially already.

Something else that could improve the performance is to define spidx as (3,nrow), so the same column is accessed at each iteration for filling the arrays. But in that particular case, with those dimensions, the difference is small.

1 Like

You’re right - seems like there was some erronous testing on my part, I apologize!

I now also get the very fast speed with the first version (without the branch, of course). Not sure where the slowdown would come from, though I’d expect the version with the conditional to be somewhat slow, as the branch and numit variable prevents SIMD (numit introduces a loop dependency on previous iterations).

1 Like

I just checked and it looked like that Julia did not compile and therefore, there was no substantial speed gain. After restarting, my computer, this now works. With the stripped down version of Sukera or mine, I get the quick results.

However, there seems to be a substantial slowdown when I run the full if-condition. And this is what I still don’t understand (something cannot be efficiently compiled …)

@time testfun02(B);
  0.116541 seconds (98.29 k allocations: 4.410 MiB, 18.54% compilation time)
@time testfun02(B);
  0.095457 seconds (68.98 k allocations: 2.655 MiB)

There has to be something else going on here. ~70k allocations for a function that (from what you’ve posted) doesn’t allocate other than the output array seems very weird to me. Can you post the full code you’re running? I’ve had to do e.g. using LinearAlgebra earlier because of the I you’re using.

Also, which version are you running? Please post the output of versioninfo():

julia> versioninfo()                           
Julia Version 1.7.0-beta3                      
Commit e76c9dad42 (2021-07-07 08:12 UTC)       
Platform Info:                                 
  OS: Linux (x86_64-linux-gnu)                 
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64                                
  LIBM: libopenlibm                            
  LLVM: libLLVM-12.0.0 (ORCJIT, skylake)       
Environment:                                   
  JULIA_PKG_SERVER =                           
  JULIA_NUM_THREADS = 4                        
Also, here are some benchmarks using BenchmarkTools
julia> using LinearAlgebra                                                      
                                                                                
julia> using BenchmarkTools                                                     
                                                                                
julia> function testfun02(M)                                                    
           nrow, ncol = size(M);                                                
           spidx = Array{Float64, 2}(undef, nrow, 3);                           
           numit = 1;                                                           
                                                                                
           for j02 = 1:ncol, j01 in 1:nrow                                      
               aval = M[j01,j02]*(j01+j02)                                      
           end                                                                  
                                                                                
           return spidx                                                         
       end                                                                      
testfun02 (generic function with 1 method)                                      
                                                                                
julia> function testfun03(M)                                                    
           nrow, ncol = size(M);                                                
           spidx = Array{Float64, 2}(undef, nrow, 3);                           
           numit = 1;                                                           
                                                                                
           for j02 = 1:ncol, j01 in 1:nrow                                      
               aval = M[j01,j02]*(j01+j02);                                     
               if aval > 0;                                                     
                   spidx[numit,1] = j01;                                        
                   spidx[numit,2] = j02;                                        
                   spidx[numit,3] = aval;                                       
                   numit +=1;                                                   
               end                                                              
           end                                                                  
           return(spidx);                                                       
       end                                                                      
testfun03 (generic function with 1 method)                                      
                                                                                
julia> @benchmark testfun02(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 5 samples with 1 evaluations.                              
 Range (min … max):   6.300 ΞΌs … 41.900 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%     
 Time  (median):     13.500 ΞΌs              β”Š GC (median):    0.00%             
 Time  (mean Β± Οƒ):   16.880 ΞΌs Β± 14.626 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%     
                                                                                
  β–ˆ          ▁    ▁                                         ▁▁                  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁                 
  6.3 ΞΌs          Histogram: frequency by time        41.9 ΞΌs <                 
                                                                                
 Memory estimate: 234.42 KiB, allocs estimate: 2.                               
                                                                                
julia> @benchmark testfun03(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 4 samples with 1 evaluations.                              
 Range (min … max):  105.353 ms … 108.477 ms  β”Š GC (min … max): 0.00% … 0.00%   
 Time  (median):     107.536 ms               β”Š GC (median):    0.00%           
 Time  (mean Β± Οƒ):   107.225 ms Β±   1.344 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%   
                                                                                
  β–ˆ                                    β–ˆ          β–ˆ           β–ˆβ–ˆ                
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁               
  105 ms           Histogram: frequency by time          108 ms <               
                                                                                
 Memory estimate: 234.42 KiB, allocs estimate: 2.                               
1 Like

Thanks. Here is the complete code I am running. It is actually just the LinearAlgebra and the BenchmarkTools missing

using LinearAlgebra
using BenchmarkTools


function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3)
    numit = 1
    
    for j01 = 1:ncol, j02 = 1:nrow
        aval = M[j02,j01]*(j01+j02)
        if aval > 0
           spidx[numit,1] = j01
           spidx[numit,2] = j02
           spidx[numit,3] = aval
            numit += 1
        end
    end
    return spidx 
end


B = Array{Float64,2}(I, 10000, 10000);
benchout = @benchmark testfun02(B);
@time testfun02(B);
@time testfun02(B);

The Output is

@time testfun02(B);
  0.111525 seconds (27.23 k allocations: 1.895 MiB, 17.53% compilation time)

@time testfun02(B);
  0.096936 seconds (2 allocations: 234.453 KiB)

The version info is

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

and that looks indeed strange.

Interesting - can you try with a 1.7 beta?

(Aside, you don’t need ; at the end of every line - that only suppresses automatic output in the REPL, it has no other effect on code and does nothing when the code is run from a file.)

1 Like

What exactly is strange here? Seems reasonable to me. Are you comparing this to some other implementation?

I compared to an older setup where the number of threads were already set to 4. But this is not the issue here. I can easily adjust the numer of threads.

Since I get a much faster execution time than you do and my CPU is older & has less cache than yours does, I suspect the julia or LLVM version makes a difference. Please test with one of the binaries from the 1.7 beta, since that could be a major difference in the code generated.

I think there is some confusion here. Seems to me that the benchmarks reported here are all similar, and similar to what I obtain.

I believe that the execution speed here is memory bounded, and I do not find the execution time particularly surprising.

You apparently run a benchmark with BenchmarkTools.@benchmark, but you don’t report its output, only the output of the @time macro, which is not part of BenchmarkTools.

Can you report the output from the @benchmark or @btime macro, and remember to interpolate the input argument, like this

@benchmark testfun02($B)
@btime testfun02($B)
1 Like

This is the output with @benchmark

BenchmarkTools.Trial: 53 samples with 1 evaluation.
 Range (min … max):  86.669 ms … 136.134 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     91.699 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   95.738 ms Β±  12.138 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒ    β–ˆ      
  β–†β–ˆβ–†β–†β–†β–†β–ˆβ–†β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–β–…β–ƒβ–β–…β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–… ▁
  86.7 ms         Histogram: frequency by time          133 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 68980.

Why are the numers so different from @time? Sorry, is probably a stupid question but I am new to Julia and this confuses me. Using @benchmark, I would suspect that there are no issues at all, right?

Sorry, the numbers are at the same order of magnitude… 100 ms are 0.1 seconds. But most of the timing problem seemsto come from the substitution of the elements in the array in the if condition. And this, I find, is surprising.

I think the point is that if your function does work that is not observable from outside the function, then the compiler may decide to skip those processing steps, and finish much faster than you expect.

Therefore it’s important, even in benchmarks that aren’t supposed to do anything useful, that the work of the function is observable from the outside.