Matrix Element-wise operations slow down in a certain size interval

usefulhyun · August 16, 2017, 3:40am

Hi, there.
Let’s see my code below.
BLAS.set_num_threads(1) G = 10^9 M = 10^6


@printf(“sz\tbin add(Mflops) bin mul(Mflops) sca add(Mflops) sca mul(Mflops)\n”)
for sz=[100, 200, 500, 1000, 1500, 2000, 2500, 3000, 4000, 8000, 12000]

times1 = Vector{Float64}()

times2 = Vector{Float64}()

times3 = Vector{Float64}()

times4 = Vector{Float64}()

for c=1:20

mat1 = ones(sz, sz)

mat2 = ones(sz, sz)

mat3 = ones(sz, sz)

mat4 = ones(sz, sz)

local r1, r2, r3, r4
s1 = @elapsed r1 = mat1 .+ mat2

s2 = @elapsed r2 = mat3 .* mat4

s3 = @elapsed r3 = mat1 + 11110

s4 = @elapsed r4 = mat2 * 22222
@assert r1[1,1] == 2

@assert r2[1,1] == 1

@assert r3[1,1] == 11111

@assert r4[1,1] == 22222
push!(times1, s1)

push!(times2, s2)

push!(times3, s3)

push!(times4, s4)
end # c
median_time1 = median(times1)

min_time1 = minimum(times1)

median_time2 = median(times2)

min_time2 = minimum(times2)

median_time3 = median(times3)

min_time3 = minimum(times3)

median_time4 = median(times4)

min_time4 = minimum(times4)
ops = szsz

median_Mflops1 = ops/(median_time1M)

max_Mflops1 = ops/(min_time1M)

median_Mflops2 = ops/(median_time2M)

max_Mflops2 = ops/(min_time2M)

median_Mflops3 = ops/(median_time3M)

max_Mflops3 = ops/(min_time3M)

median_Mflops4 = ops/(median_time4M)

max_Mflops4 = ops/(min_time4*M)
@printf(“%d\t”, sz)

@printf(“%.2f\t%.2f\t”, median_Mflops1, max_Mflops1)

@printf(“%.2f\t%.2f\t”, median_Mflops2, max_Mflops2)

@printf(“%.2f\t%.2f\t”, median_Mflops3, max_Mflops3)

@printf(“%.2f\t%.2f\n”, median_Mflops4, max_Mflops4)

end # sz
To sum up, operations slow down in the size from 2000 ~ 4000 for each dimension.
And it recovers its speed after 4000~.
As far as I know, operations is fast in the small size due to cache.
But in big size, I think speed should be fixed.
My computer has 32 G RAM, so I think this is not a problem with memory.
Because 3000 by 3000 float64 matrix is only 72 Mbytes.
Does anyone know this problem?

usefulhyun · August 16, 2017, 4:01am

I partly solved this problem by assigning useless variable to 0 and gc() before the computation for time measure.
But the size range from 2500 to 4000 still slower than other range even a range is bigger size than 4000.

Topic		Replies	Views
Performance problem with Partitioned Matrix Multiplication by Matrix{Matrix} type Numerics	0	572	August 15, 2017
Blas dot doing weird things depending on size Performance question	3	539	October 29, 2019
Operations on small matrices and BLAS in 1.6.0 RC1 New to Julia	2	417	March 2, 2021
Why would A \ b grow with O(n^2)? Numerics question , linearalgebra	5	487	November 12, 2023
Fluctuations when measuring execution time of linear algebra code Performance	12	1023	July 19, 2019

Matrix Element-wise operations slow down in a certain size interval

Related topics