Scalar multiplication makes array reallocation

Hi everyone,

While testing the fastest approach to multiply a matrix to a vector a discovered a nasty fact about reallocation.
Multiplying a matrix for a constant-defined value (namely a scalar) makes Julia reallocate the array, even within the @views scope. If the multiplication does not happen, the memory is never reallocated.

Could you please explain to me why?

The code below shows what I mean

function network_test()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		state = rand([false,true],n)
		@views out = w[3,:,:]*state
	end
    return  w
end

function network_scalar()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		#

		state = rand([false,true],n)
		@views out = 0.1*w[3,:,:]*state
	end
    return  w
end

using BenchmarkTools
@btime network_test()
#  2.618 s (12006 allocations: 79.20 MiB)
@btime network_scalar()
#  7.480 s (14006 allocations: 14.98 GiB)


I think what you want is to create the view, then broadcast setindex!

@views out = w[3,:,:]
out .*= 0.1 .* state

That shouldn’t allocate. But you are intending to modify the original array?

Maybe use . to broadcast

julia> w = rand(1_000_000);

julia> @time w = w + w;
  0.004330 seconds (2 allocations: 7.629 MiB)

julia> @time w = w + w;
  0.004495 seconds (2 allocations: 7.629 MiB)

julia> @time w .= w .+ w;
  0.000981 seconds (2 allocations: 48 bytes)

julia> @time w .= w .+ w;
  0.001092 seconds (2 allocations: 48 bytes)

It would be interesting to understand, what is going on behind the scene, but one way to solve it is to use parenthesis:

@views out .= 0.1 .* (w[3,:,:] * state)

On a side note, you can also reduce number of allocations by changing state generation to

state = rand((false,true),n)

Please tell us about the speed up after fixing

Maybe the first thing to know is that * proceeds left to right:

julia> @which 0.1*w[3,:,:]*state  # @less will show the code
*(a, b, c, xs...) in Base at operators.jl:538

and so this takes the view, multiplies it by a scalar to make a matrix, and then multiplies that by the vector. It would be more efficient from this perspective to write 0.1 * (w[3,:,:]*state) which is the same as 0.1 .* w[3,:,:]*state, or better lmul!(0.1, w[3,:,:]*state). (Soon, perhaps, https://github.com/JuliaLang/julia/pull/37898 may automate such things.) If out was pre-allocated, you could also write mul!(out, w[3,:,:], state, 0.1, 0).

julia> using LinearAlgebra

julia> out = rand(1000); w = rand(3,1000,1000) .- 0.5; state = rand(Bool, 1000);

julia> @btime (0.1 * (@view $w[3,:,:])) * $state; # allocates a new matrix, then a vector
  1.853 ms (4 allocations: 7.64 MiB)

julia> @btime 0.1 * ((@view $w[3,:,:]) * $state); # allocates two vectors
  1.193 ms (5 allocations: 23.89 KiB)

julia> @btime mul!($out, $(@view w[3,:,:]), $state, 0.1, 0.0); # no allocation
  1.189 ms (0 allocations: 0 bytes)

But the other thing to know is that you aren’t hitting the fast * here at all, because the element types don’t match, and because the view of w isn’t a happy one. You might do better to re-arrange the dimensions of w, and to take state = rand((1.0, 0.0),n) instead.

julia> @btime mul!($out, $(@view w[3,:,:]), $(state .+ 0.0)); # both Float64
  1.186 ms (2 allocations: 80 bytes)

julia> @btime mul!($out, $(w[3,:,:]), $(state .+ 0.0)); # BLAS
  151.750 ÎĽs (0 allocations: 0 bytes)

julia> w2 = permutedims(w,(2,3,1)); size(w2)
(1000, 1000, 3)

julia> @btime mul!($out, $(@view w2[:,:,3]), $(state .+ 0.0)); # safe view
  151.850 ÎĽs (0 allocations: 0 bytes)

julia> strides(@view w[3,:,:]) # problem
(3, 3000)

julia> strides(@view w2[:,:,3]) # no problem
(1, 1000)
4 Likes

@views only affect indexing expressions. So the above is equivalent to

out = 0.1 * view(w, 3, :, :) * state

so you don’t allocate anything in the slicing operation. But both view(w,3,:,:) * state will allocate a completely new array, and multiplication with 0.1 will also allocate a new array. There is also no in-place copying into out going on here.

You have to make your operations in-place, the view only helps with the slicing.

A couple of other remarks:

These pre-allocations have no effect:

state = falses(n)
out = ones(Float64,n)

because you over-write them later (not in-place, just new assignment), so they are wasted. In other words, you re-use the labels state and out, but you don’t re-use the arrays that those labels were applied to.

Also, false(n) is very different from rand([false, true], n). The former creates a BitVector, while the latter makes a vector of Bools.

For a vector of random true/false use either state = rand(Bool, n) or state = Random.bitrand(n). Then, in order to update this vector in-place, write rand!(state).

3 Likes

Hi DNF,
Thanks for the reply.

How to make the operations in place?
I thought that pre-defining the variable out of the loop-cycle was enough!
Also ´cause this loop is much more inefficient (3x) although the allocations are less

function network_scalar()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = ones(Float64,n)
    for tt in 1:2000
		#

		for cc in r
			for dd in r
				@views out[cc] =w[3,cc,dd]*state[dd]
			end
		end
	end
    return  w
end

## with matrix multiplication
  2.357 s (12006 allocations: 79.20 MiB)

## with indexing
  12.710 s (7 allocations: 45.78 MiB)

I was writing an answer, but realized I don’t understand what you are trying to do. Why do you return w from your function? Nothing has happened to that.

The answer from @mcabbott shows how you can do this using mul!. Here’s an example without the benchmarking code:

function network_scalar()
	n = 1000
	w = rand(n, n, 3) .- 0.5  # this is better than rand(3, n, n)
    state = rand((0.0, 1.0), n)
	out = ones(Float64, n)
    for tt in 1:2000
        state .= (rand.() .< 0.5)
        wv = @view w[:, :, 3]
        mul!(out, wv, state, 0.1, 0.0)
	end
    return out
end

I changed w from rand(3, n, n) to rand(n, n, 3), that makes a huge difference in performance.

1 Like

Yes, that’s right.
Itis because this is the minimal working example of 600 lines code… Which also have other trade-offs (like have a boolean state array)

So a meaningful example would be

function network_test()
	n =1000
	w = rand(3,n,n) .- 0.5
	r = 1:n
    state = falses(n)
	out = zeros(Float64,n)
    for tt in 1:2000
		for cc in n
			state[cc] = rand((0.,1.))
		end
		@views out = w[3,:,:]*state
	end
    return  out,w
end

Ok, that’s fine.

Can you explain to me why the (n,n,3) is better than (n,n,3) ?
I thought that being Julia column-major you always want to set the accessed dimensions at the outmost of the matrix.

Thanks

It’s because Julia arrays are column-major that (n,n,3) is better than (3,n,n). You want to slice along the innermost dimensions of the array, not the outermost.

It’s the same reason that it is faster to loop like this

for j in 1:n
    for i in 1:m
        x[i, j] = foo()
    end
end

than like this

for i in 1:m
    for j in 1:n
        x[i, j] = foo()
    end
end

Ok, I misunderstood it :confused:
Thanks

Here you can see the difference between row- and column-major, and why it is better to move down along the columns for column major:

1 Like