Kron vs scalar product speed difference. python code faster?

You can look at what is happening using expand.

In your version there isn’t any syntax-level broadcast fusion happening, which means that every operator basically creates a new temporary array

julia> expand(:(Delta_W .+= lr * ( x * ehp' - xneg * ehn')'))
:((Base.broadcast!)(+,Delta_W,Delta_W,A_mul_Bc(lr,A_mul_Bc(x,ehp) - A_mul_Bc(xneg,ehn))))

In the optimized version there are only broadcasted operators used. These dots . serve as a syntax sugar for broadcast and in master there is something happening that is called broadcast fusion. basically it merges all the dotted operators/function in one inner function that is broadcasted over the individual arrays. Thus no temporary memory for the inbetween computation needs to be allocated. Take a look:

julia> expand(:(Delta_W .+= lr .* (ehp .* x' .- ehn .* xneg')))
:($(Expr(:thunk, CodeInfo(:(begin 
        $(Expr(:thunk, CodeInfo(:(begin 
        global ##3#4
        const ##3#4
        $(Expr(:composite_type, Symbol("##3#4"), :((Core.svec)()), :((Core.svec)()), :(Core.Function), :((Core.svec)()), false, 0))
        return
    end))))
        $(Expr(:method, false, :((Core.svec)((Core.svec)(##3#4,Any,Any,Any,Any,Any,Any),(Core.svec)())), CodeInfo(:(begin 
        #temp#@_9 = #temp#@_4 * #temp#@_5
        #temp#@_8 = #temp#@_6 * #temp#@_7
        #temp#@_10 = #temp#@_9 - #temp#@_8
        #temp#@_11 = #temp#@_3 * #temp#@_10
        return #temp#@_2 + #temp#@_11
    end)), false))
        #3 = $(Expr(:new, Symbol("##3#4")))
        SSAValue(0) = #3
        SSAValue(1) = ctranspose(x)
        SSAValue(2) = ctranspose(xneg)
        return (Base.broadcast!)(SSAValue(0),Delta_W,Delta_W,lr,ehp,SSAValue(1),ehn,SSAValue(2))
    end)))))

EDIT: don’t be fooled by the syntax highlighting. there are no comments here. Somehow I am unable to turn of syntax highlighting.

EDIT: Notice the lower two lines that say ctranspose here the new rowvector addition comes into play that was mentioned before

2 Likes