Substituting elements of array is very slow

mscheffel · July 21, 2021, 2:45pm

I am new to Julia and I have problems with the speed. What is the fastest way to replace rows in an array? Currently I iterate over any entry of a matrix and when the result of the associated calculation is nonzero, I store the position (row and col) and the value (aval) in an additional array that will serve for sparse matrix allocation at a later stage of the code.

The code here is a nonsense simplified example to get to the point. On my computer, the replacements in the if condition triple the computation time from around 0.04 seconds to 0.12 seconds. What am I doing wrong here?

Thanks in advance.

function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3);
    numit = 1;

    for j02 = 1:ncol, j01 in 1:nrow
        aval = M[j01,j02]*(j01+j02);
        if aval > 0;
            spidx[numit,1] = j01;
            spidx[numit,2] = j02;
            spidx[numit,3] = aval;
            numit +=1;
        end
    end
    return(spidx);
end


B = Matrix{Float64}(I, 10000, 10000);
@time testfun02(B);

Sukera · July 21, 2021, 3:02pm

Do you mean if the function is written like this:

function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3);
    numit = 1;

    for j02 = 1:ncol, j01 in 1:nrow
        aval = M[j01,j02]*(j01+j02)
    end

    return spidx
end

It takes just 0.04 seconds? If I were you, I’d be surprised it took that long in the first place - after all, that function isn’t doing any real calculation that can be observed after the call.

Other than that, I think your iteration order is wonky - check Access arrays in memory order, along columns for more info.

mscheffel · July 21, 2021, 3:30pm

Exactly. Of course, I know that it does not return anything interesting, I am more interested in the original function, but I wanted to dissect the problem and find out where it takes time. And for this purpose, he function seems reasonable. I checked the iteration order but I think it is fine.

Sukera · July 21, 2021, 3:33pm

I don’t think so:

julia> function testfun04(M)                                                
           nrow, ncol = size(M)                                             
           spidx = Array{Float64, 2}(undef, nrow, 3)                        
           numit = 1                                                        
                                                                            
           for j01 = 1:ncol, j02 in 1:nrow                                  
               aval = M[j02,j01] * (j01+j02)                                              
           end                                          
                    
           return spidx                                
       end                                                                  
testfun04 (generic function with 1 method)                                  
                                                                            
julia> @time testfun04(B);                                                  
  0.025862 seconds (25.43 k allocations: 1.635 MiB, 99.55% compilation time)
                                                                            
julia> @time testfun04(B);                                                  
  0.000051 seconds (2 allocations: 234.422 KiB)

mscheffel · July 21, 2021, 3:36pm

Ok. crazy. Thanks. That solves quite a lot.

lmiq · July 21, 2021, 4:32pm

Sorry, but what changed there? Seemed to me that the iteration order was correct initially already.

Something else that could improve the performance is to define spidx as (3,nrow), so the same column is accessed at each iteration for filling the arrays. But in that particular case, with those dimensions, the difference is small.

Sukera · July 21, 2021, 4:35pm

You’re right - seems like there was some erronous testing on my part, I apologize!

I now also get the very fast speed with the first version (without the branch, of course). Not sure where the slowdown would come from, though I’d expect the version with the conditional to be somewhat slow, as the branch and numit variable prevents SIMD (numit introduces a loop dependency on previous iterations).

mscheffel · July 21, 2021, 4:47pm

I just checked and it looked like that Julia did not compile and therefore, there was no substantial speed gain. After restarting, my computer, this now works. With the stripped down version of Sukera or mine, I get the quick results.

However, there seems to be a substantial slowdown when I run the full if-condition. And this is what I still don’t understand (something cannot be efficiently compiled …)

@time testfun02(B);
  0.116541 seconds (98.29 k allocations: 4.410 MiB, 18.54% compilation time)
@time testfun02(B);
  0.095457 seconds (68.98 k allocations: 2.655 MiB)

Sukera · July 21, 2021, 5:11pm

There has to be something else going on here. ~70k allocations for a function that (from what you’ve posted) doesn’t allocate other than the output array seems very weird to me. Can you post the full code you’re running? I’ve had to do e.g. using LinearAlgebra earlier because of the I you’re using.

Also, which version are you running? Please post the output of versioninfo():

julia> versioninfo()                           
Julia Version 1.7.0-beta3                      
Commit e76c9dad42 (2021-07-07 08:12 UTC)       
Platform Info:                                 
  OS: Linux (x86_64-linux-gnu)                 
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64                                
  LIBM: libopenlibm                            
  LLVM: libLLVM-12.0.0 (ORCJIT, skylake)       
Environment:                                   
  JULIA_PKG_SERVER =                           
  JULIA_NUM_THREADS = 4

Sukera · July 21, 2021, 5:20pm

Also, here are some benchmarks using BenchmarkTools

julia> using LinearAlgebra                                                      
                                                                                
julia> using BenchmarkTools                                                     
                                                                                
julia> function testfun02(M)                                                    
           nrow, ncol = size(M);                                                
           spidx = Array{Float64, 2}(undef, nrow, 3);                           
           numit = 1;                                                           
                                                                                
           for j02 = 1:ncol, j01 in 1:nrow                                      
               aval = M[j01,j02]*(j01+j02)                                      
           end                                                                  
                                                                                
           return spidx                                                         
       end                                                                      
testfun02 (generic function with 1 method)                                      
                                                                                
julia> function testfun03(M)                                                    
           nrow, ncol = size(M);                                                
           spidx = Array{Float64, 2}(undef, nrow, 3);                           
           numit = 1;                                                           
                                                                                
           for j02 = 1:ncol, j01 in 1:nrow                                      
               aval = M[j01,j02]*(j01+j02);                                     
               if aval > 0;                                                     
                   spidx[numit,1] = j01;                                        
                   spidx[numit,2] = j02;                                        
                   spidx[numit,3] = aval;                                       
                   numit +=1;                                                   
               end                                                              
           end                                                                  
           return(spidx);                                                       
       end                                                                      
testfun03 (generic function with 1 method)                                      
                                                                                
julia> @benchmark testfun02(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 5 samples with 1 evaluations.                              
 Range (min … max):   6.300 μs … 41.900 μs  ┊ GC (min … max): 0.00% … 0.00%     
 Time  (median):     13.500 μs              ┊ GC (median):    0.00%             
 Time  (mean ± σ):   16.880 μs ± 14.626 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%     
                                                                                
  █          ▁    ▁                                         ▁▁                  
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁                 
  6.3 μs          Histogram: frequency by time        41.9 μs <                 
                                                                                
 Memory estimate: 234.42 KiB, allocs estimate: 2.                               
                                                                                
julia> @benchmark testfun03(B) setup=(B=Matrix{Float64}(I, 10000,10000)) evals=1
BechmarkTools.Trial: 4 samples with 1 evaluations.                              
 Range (min … max):  105.353 ms … 108.477 ms  ┊ GC (min … max): 0.00% … 0.00%   
 Time  (median):     107.536 ms               ┊ GC (median):    0.00%           
 Time  (mean ± σ):   107.225 ms ±   1.344 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%   
                                                                                
  █                                    █          █           ██                
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁               
  105 ms           Histogram: frequency by time          108 ms <               
                                                                                
 Memory estimate: 234.42 KiB, allocs estimate: 2.

mscheffel · July 21, 2021, 6:35pm

Thanks. Here is the complete code I am running. It is actually just the LinearAlgebra and the BenchmarkTools missing

using LinearAlgebra
using BenchmarkTools


function testfun02(M)
    nrow, ncol = size(M);
    spidx = Array{Float64, 2}(undef, nrow, 3)
    numit = 1
    
    for j01 = 1:ncol, j02 = 1:nrow
        aval = M[j02,j01]*(j01+j02)
        if aval > 0
           spidx[numit,1] = j01
           spidx[numit,2] = j02
           spidx[numit,3] = aval
            numit += 1
        end
    end
    return spidx 
end


B = Array{Float64,2}(I, 10000, 10000);
benchout = @benchmark testfun02(B);
@time testfun02(B);
@time testfun02(B);

The Output is

@time testfun02(B);
  0.111525 seconds (27.23 k allocations: 1.895 MiB, 17.53% compilation time)

@time testfun02(B);
  0.096936 seconds (2 allocations: 234.453 KiB)

The version info is

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

and that looks indeed strange.

Sukera · July 21, 2021, 6:42pm

Interesting - can you try with a 1.7 beta?

(Aside, you don’t need ; at the end of every line - that only suppresses automatic output in the REPL, it has no other effect on code and does nothing when the code is run from a file.)

lmiq · July 21, 2021, 7:52pm

What exactly is strange here? Seems reasonable to me. Are you comparing this to some other implementation?

mscheffel · July 21, 2021, 8:26pm

I compared to an older setup where the number of threads were already set to 4. But this is not the issue here. I can easily adjust the numer of threads.

Sukera · July 21, 2021, 8:31pm

Since I get a much faster execution time than you do and my CPU is older & has less cache than yours does, I suspect the julia or LLVM version makes a difference. Please test with one of the binaries from the 1.7 beta, since that could be a major difference in the code generated.

lmiq · July 21, 2021, 9:12pm

I think there is some confusion here. Seems to me that the benchmarks reported here are all similar, and similar to what I obtain.

I believe that the execution speed here is memory bounded, and I do not find the execution time particularly surprising.

DNF · July 21, 2021, 9:51pm

You apparently run a benchmark with BenchmarkTools.@benchmark, but you don’t report its output, only the output of the @time macro, which is not part of BenchmarkTools.

Can you report the output from the @benchmark or @btime macro, and remember to interpolate the input argument, like this

@benchmark testfun02($B)
@btime testfun02($B)

mscheffel · July 21, 2021, 10:13pm

This is the output with @benchmark

BenchmarkTools.Trial: 53 samples with 1 evaluation.
 Range (min … max):  86.669 ms … 136.134 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     91.699 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   95.738 ms ±  12.138 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃    █      
  ▆█▆▆▆▆█▆▃▃▃▃▁▃▁▅▃▁▅▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▅ ▁
  86.7 ms         Histogram: frequency by time          133 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 68980.

Why are the numers so different from @time? Sorry, is probably a stupid question but I am new to Julia and this confuses me. Using @benchmark, I would suspect that there are no issues at all, right?

mscheffel · July 21, 2021, 10:16pm

Sorry, the numbers are at the same order of magnitude… 100 ms are 0.1 seconds. But most of the timing problem seemsto come from the substitution of the elements in the array in the if condition. And this, I find, is surprising.

DNF · July 21, 2021, 10:16pm

I think the point is that if your function does work that is not observable from outside the function, then the compiler may decide to skip those processing steps, and finish much faster than you expect.

Therefore it’s important, even in benchmarks that aren’t supposed to do anything useful, that the work of the function is observable from the outside.

Topic		Replies	Views
Why substituting "for loop" with direct assignment of Matrix worsen the performance time? General Usage question	20	874	September 23, 2021
Replacing missing values in a matrix is super slow General Usage question	38	1319	December 17, 2021
Speeding up some non-optimized Julia functions Performance	47	1206	July 19, 2022
Making efficient some small subroutines Performance	22	1133	May 8, 2020
Performance difference when accessing square matrix rows-first or cols-first Performance	14	1733	April 13, 2021

Substituting elements of array is very slow

Related topics