Unexpected broadcasting behavior involving eachrow()

If you think a step further, you realize that this means x .* 2 would modify x in-place.

I’m sorry, but you should realize that Julia “is not at that stage of development” anymore. Changing completely core features like this and throwing the language and ecosystem into disarray, would be very harmful. The better approach is to ask questions and discuss how to get used to this, and perhaps suggest improvements to documentation.

1 Like

Answering my own question:

julia> A = [1.0 2; 3 4; 5 6];

julia> broadcast!(/, A, A, Base.broadcasted(norm, eachrow(A)));

julia> norm.(eachrow(A))
3-element Vector{Float64}:
 1.0734900802433864
 1.1567173855460329
 1.1826156918749102

But that doesn’t really provide much intuition for the behavior of A ./= norm.(eachrow(A)).

1 Like

You forgot to fuse that loop: broadcast!(function (a,b) a/norm(b) end, x, x, eachrow(x) )

Apart from the fact that the same variable appears on both sides of .=, I think one issue here is that all objects appearing in the broadcasted expression don’t have the same shape: if the shape of x is (m,n), then the shape of norm.(eachrow(x)) is (m,). Under broadcasting rules, the singleton dimension is expanded, which leads to the norm of each row being computed multiple times.

When all objects have the same shape, broadcasting is much more similar to map or map! (and much simpler to understand IMO; I actually tend to restrict myself to such use cases).

2 Likes

I used Base.broadcasted for the inner call, rather than broadcast, so I believe yours and mine are the same:

julia> A = [3 4; 6 8; 9 12.0];

julia> broadcast!(/, A, A, Base.broadcasted(norm, eachrow(A)))
3×2 Matrix{Float64}:
 0.6  0.988936
 0.6  0.997199
 0.6  0.998752

julia> norm.(eachrow(A))
3-element Vector{Float64}:
 1.1567173855460329
 1.1637896990616534
 1.165120695213146

julia> B = [3 4; 6 8; 9 12.0];

julia> broadcast!(function (a,b) a/norm(b) end, B, B, eachrow(B))
3×2 Matrix{Float64}:
 0.6  0.988936
 0.6  0.997199
 0.6  0.998752

julia> norm.(eachrow(B))
3-element Vector{Float64}:
 1.1567173855460329
 1.1637896990616534
 1.165120695213146

Correct for the results. However, it is also true that your line reflects the part of the broadcasting process before the Broadcasted instances’ operations are fused by materialize!.

With help from @ffevotte and @Benny, I think I finally understand how to describe A ./ norm.(eachrow(A)) conceptually. Consider the matrix A = [3 4; 6 8; 9 12]. At first I imagined conceptually that norm should be broadcasted over eachrow(A) to produce [5.0, 10.0, 15.0] and then conceptually that vector is expanded to the matrix

[ 5.0   5.0
 10.0  10.0
 15.0  15.0]

And then we have element-wise division between A and the above matrix of norms. With this conceptual model, norm is only called 3 times.

(*When I say “conceptually”, I mean conceptually. Obviously there are performance optimizations involved, so this is not precisely what happens under the hood.)

However, as @pdeffebach showed above, for this simple example (even without the in-place assignment to A), the norm function gets called 6 times. So the correct conceptual model is that eachrow(A) gets expanded before norm is applied. In detail, we have

1

eachrow(A) is expanded to

# `TEMP` is merely a conceptual variable
TEMP = [
  [3, 4]   [3, 4]
  [6, 8]   [6, 8]
  [9, 12]  [9, 12]
]

2

The following function is applied element wise to the matrices A and TEMP:

function f(a, temp)
    a / norm(temp)
end

so the result is

[0.6  0.8
 0.6  0.8
 0.6  0.8]

and norm gets called 6 times, because f was applied element-wise, i.e. to the six corresponding elements of A and TEMP.

Is there a missing performance optimization here? It seems like we ought to be able to tell that we can apply norm to eachrow(A) and then expand the resulting vector into the second dimension, rather than expanding eachrow(A) first and then applying norm to the expanded matrix.

Unless I’m missing something, in order to do that you’d need to materialize norm.(eachrow(A)).

You can do it, but it allocates:

ulia> A = Float64[3 4; 6 8; 9 12]
3×2 Matrix{Float64}:
 3.0   4.0
 6.0   8.0
 9.0  12.0

julia> A ./= collect(norm.(eachrow(A)))
3×2 Matrix{Float64}:
 0.6  0.8
 0.6  0.8
 0.6  0.8

julia> norm.(eachrow(A))
3-element Vector{Float64}:
 1.0
 1.0
 1.0

Well, I feel like there ought to be a lazy way of implementing it. If A is a matrix and x is a vector, and we consider the expression A .+ f.(x), then it will always be cheaper to conceptually apply f element-wise to x and then expand, rather than expanding x and then applying f element-wise.

My mental model for

A ./= norm.(eachrow(A))

is along these lines:

  1. evaluate eachrow(A) because it’s not broadcasted at all: this produces

    rows = eachrow(A)
    A ./= norm.(rows)
    
  2. apply broadcasting rules to transform this into a loop over the longest axis in each dimension:

    rows = eachrow(A)
    for j in axes(A,2)
        for (i, row) in zip(axes(A,1), rows)
            A[i,j] = A[i,j] / norm(row)
        end
    end
    
2 Likes

Not always. The cost of f has to out-weigh the cost of allocating some space.

And you have to be sure that f is pure. At present this is not assumed, and so this idiom gets you 5 different random numbers:

julia> [10 10 10 10 10] .+ rand.()
1×5 Matrix{Float64}:
 10.1643  10.9285  10.7375  10.4303  10.0446

In some broadcasts, you could potentially solve the first problem by re-using a part of the eventual output array, rather than allocating a new one. But you’d still have to check purity (and lack of aliasing, the original problem here), and to think about whether the resulting access pattern is a good one one.

1 Like

I meant if we had a smart, lazy implementation that avoided the temporary array. :man_facepalming:


I think @Benny is right. The correct way to write A ./ norm.(eachrow(a)) as a broadcast call is this:

broadcast(A, eachrow(A)) do a, row
    a / norm(row)
end

Thus, the performance optimization that I was hoping for would actually break the contract of broadcast, because the contract of broadcast(f, As...) is that the input collections As... will be virtually extended so that they match in size/shape, and then the function f will be applied element-wise over the virtually extended collections. So in my A ./ norm.(eachrow(a)) example where A is a 3x2 matrix, f should be called 6 times. This does have a visible effect on the API, because as @mcabbott mentioned, f could have side effects.

by the way, if anybody wants the @unfuse macro aforeproposed by @Benny , try

macro unfuse(expr)
    function traverse(node)
        if isa(node, Expr)
            node.args = [traverse(arg) for arg in node.args]
            if node.head === :.
                _node = Expr(:call)
                _node.args = [:identity, node]      
                node = _node          
            end
        end
        return node
    end
    return esc(traverse(expr))
end


##
julia> @macroexpand @unfuse x ./= norm.(eachrow(x))
:(x ./= identity(norm.(eachrow(x))))

I think it will work but I didn’t test all edge cases. If there is a good “misc. macros” package to submit a PR for let me know.

2 Likes

That’s interesting but the @macroexpand would be clearer if it split into multiple lines. Imagine a chain of >5 dots, talk about an identity crisis.

It also ignores all dotted binary operators because those have their own symbols, like :(.+), instead of a :(.)-headed expression, e.g. @unfuse x ./= a .+ b doesn’t work. It should only ignore assignment-like symbols with a = character.

Well, any lazy data structure has to re-compute the element every access because it doesn’t have the memory to store results for lookup. Space-time tradeoff~

Yes. The other options are to store it in a register, or in a part of the output array you don’t yet need. Writing out the loops, sometimes you can compute f for each column of the output as you go. But if you have to do it for each row, as in the example A .+ f.(x), then it the result isn’t cache-friendly.

julia> let A = rand(10^6, 2)
           x = rand(10^6)
           @btime $A ./ ($x .^ 2)        # does too many ^ calls
           @btime $A ./ map(z->z^2, $x)  # eager, does not pay
           println()
           @btime dumb($A, $x)    # same as broadcasting, to check
           @btime smart1($A, $x)  # no storage, but cache-unfriendly
           @btime smart2($A, $x)  # re-uses output as storage, two passes
       end;
  min 868.500 μs, mean 2.149 ms (2 allocations, 15.26 MiB)
  min 2.664 ms, mean 3.584 ms (4 allocations, 22.89 MiB)

  min 869.041 μs, mean 2.208 ms (2 allocations, 15.26 MiB)
  min 3.233 ms, mean 3.639 ms (2 allocations, 15.26 MiB)
  min 2.546 ms, mean 3.142 ms (2 allocations, 15.26 MiB)

julia> function smart1(A, x)
         B = similar(A)
         @inbounds for i in axes(A,1)
           z = x[i]^2  # compute this once & re-use it
           for j in axes(A,2)
             B[i,j] = A[i,j] / z
           end
         end
         B
       end;

julia> function smart2(A, x)
         B = similar(A)
         @simd for i in axes(A,1)
           @inbounds B[i,end] = x[i]^2  # use part of B as storage
         end
         for j in axes(A,2) 
           @simd for i in axes(A,1)  # inner loop on 1st axis
             @inbounds z = B[i,end]
             @inbounds B[i,j] = A[i,j] / z
           end
         end
         B
       end;

julia> function dumb(A, x)  # same loops as broadcasting, to check
         B = similar(A)
         for j in axes(A,2)
           @simd for i in axes(A,1)
             @inbounds B[i,j] = A[i,j] / x[i]^2
           end
         end
         B
       end;
2 Likes

This is a reasonable preference or expectation, but calling it a convention is pushing it. Not many languages’ bases have multidimensional arrays, let alone broadcasting, and I don’t know of a language off the top of my head that can broadcast arbitrary functions (MATLAB’s bsxfun is closest but it only takes 2 input arrays). Having no convention to follow, Julia decided that loop fusion was good and parentheses determined the order of operations per iteration of a fused loop. I agree that people won’t expect this coming from different languages, and that’s exactly why Julia documents this very thoroughly: .* isn’t an operator or some different version of *, a sequence of adjacent dotted function calls is transformed to a broadcast call. It’s like a macro, but fundamental features e.g. if/for don’t need a macro.

I would also argue that it’s preferable the way it is. Your proposal to remove loop fusion requires allocation and discarding of temporary arrays, and if you’re going to allocate anyways, why not explicitly allocate 2 separate copies, 1 for unchanged input and 1 for changed output? You had x already in your original example, so y2 .*= (1 ./ LinearAlgebra.norm.(eachrow(x)) ) would’ve done what you wanted. If you didn’t want to repeat norm operations, you could have done it in an earlier line like with y1 or broken the sequence of dots with identity. You have a lot of control over allocations with loop fusion as it is. Without it, you’re stuck with as many allocations as the number of operations, unless you use some @fusedots macro to get the current behavior.

You cannot save allocations like that in general. Say x is a Vector{Int}, and you do (x .+ 1) .* 2im. x .+ 1 would allocate a Vector{Int}, and you can’t safely write Complex{Int} to that. The only way to know to allocate a Vector{Complex{Int}} in advance is to fuse the sequence of operations, again violating the boundaries of the parentheses.

Secondary parentheses-like syntax would be terrible to read and is even harder for people coming from other languages to learn than “order of operations per iteration of a fused loop”. Let’s say I want to do a .+ b .+ (c .+ d) but only want to fuse the last two additions, so a .+ b happens first in its own loop. Naively I might write a .+ {{b .+ (c .+ d)}}, but that looks like it forces {{b .+ (c .+ d)}} to happen first. But you might say {{}} doesn’t do order of operations like (), so you can write (a .+ {{b) .+ (c .+ d)}}. That doesn’t look very conventional. And if I wanted to do a couple fused loops whose results are in another fused loop, a one-liner would look like {{...{{...}}...{{...}}...}}. I think many people would prefer to write separate lines than try to figure which bracket ends which.

1 Like

I use ( ) extensively. That means, instead of remembering (and believing my memory of) operator precedence, I would put a lot of ( ) explicitly.

For example I would never write !a && b. Instead I would code either (!a) && b or !(a && b).

Not sure if it’s a good coding practice (up to debate I guess), but it works well for me: reading the code is 100% clear now. And nothing about precedence needs to remember.

So, the result is I rely strongly on the use of ( ). For decades this practice works perfectly well, until this time.

Anyway this extensive reliance on ( ) seems not to be a majority…

That {{ }} is simply an idea, not a formal proposal. The point is I want something to distinguish a “ok-to-fuse-into” parenthesis from a “conventional” one.

For special case where a variable occurs in both LHS and RHS of a broadcasting statement, at least some kind of warning alerts should show.

1 Like

It is not. You do not have to rely on your memory either, you can just have a tab with all operator precedence open while you are learning (or to run a snippet in the REPL to check). Writing/reading code that does not have unnecessary parentheses leads you to get acquainted with the operator precedence, instead of always getting stuck with this gap.

It really is not.

There are many uses of x .= f.(x) which are intended and the perfectly correct (i.e., you want to make a transformation of every element and write back to the same structure), makes no sense to me to give a warning for a pattern that is so common. Maybe a linter could have an option, but many codebases would trigger a lot of messages for this option and they would have no error or even a code smell for doing that.

Referential transparency however is certainly a strong convention, and this is broken by automatic loop fusion. Any other place (that I can think of) in Julia where this convention is broken is going to be inside a macro clearly marked by @ — this is why I would have chosen to have users explicitly opt-in to the behavior with @fuse rather than having to opt out.

Anyway, it’s clear that @tomtom and I are in pretty sparse company on this opinion, and the design is not going to change anytime soon, so rather than continuing to argue, I’ll just propose a warning in the docs (can write this later today)

1 Like