Blog post: Loop fusion and vectorization in Julia 0.6



Some of you may be interested in my new post on the Julia blog, discussing the new features for efficient “vectorized” code in Julia 0.6:


Great article! How does BLAS fit in to the story? I.e., you make it clear that fusing in

y .= 2 .* x .+ 3 .* y

Has the benfit that it only has to load entries from memory once. How would this compare to the BLAS version


I suppose the BLAS version has to load entries from memory twice, but does it benefit from optimization or multithreading? Which of the two is recommended if performance is the top concern?


This is great! Good emphasis on the fact that Julia’s advance is in allowing vectorisation/fusion for arbitrary, user-defined functions.

If you define your own container type that is not a subtype of AbstractArray, you can tell broadcast to treat it as a container to be iterated over by overloading Base.Broadcast.containertype and a couple of other functions.

Is there an easy-to-read DIY-manual for this somewhere?

broadcast versus map

This section is good :slight_smile: I was actually wondering about the differences much earlier already, starting in the section “Which functions are vectorized”, and was worried this broadcast-versus-map wouldn’t be answered at all – it might be worth including a crosslink in the post?

Also, I was a bit confused by the combination of statements ‘map handles “shapeless”’ and ‘map requires all array-like arguments to have the same shape’. I see the ‘array-like’ qualification, but these statements about broadcast and map got a bit too abstract for me to follow, I would need maybe an example (or to see the general notation with different arguments where this applies) to follow this :confused:


@dlfivefifty, as a rule of thumb, you don’t get much benefit from using BLAS for “level-1” BLAS operations that do O(n) work on O(n) data. It uses SIMD, but LLVM is pretty decent at that nowadays. It takes a very large array to benefit from multiple threads since there is so little work per thread (hopefully broadcast will also be able to use threads someday:

For your example of y .= 2 .* x .+ 3 .* y, on my machine it is 7× faster than BLAS for 1000-element arrays, and about the same speed as the BLAS calls for much larger arrays. (For 10⁵ elements, it is almost exactly the same speed as BLAS, while for 10⁷ elements it is about 30% slower than the BLAS calls, presumably because the latter benefits from multi-threading for such a large array.)


No. Right now it is a little bit tricky to get a custom container type (one that is not a AbstractArray or is an array but behaves very differently than a dense array) to work with broadcast (one of the most complicated functions in Base), at least if you want it to be easily combinable with scalars and other containers of different types. Hopefully, in the future we can make this easier.


@felix, I’ve added a couple of forward references and clarifications to the broadcast-vs-map section at the end.


Thanks for this blog post, it was very informative. I have a question about the return size of broadcast in 0.6: does it always expand to the size that contains all the arguments? More formally, when

y = broadcast(f, xs...)

does the following invariant hold?

function unisize(size1, size2)
    len1 = length(size1)
    len2 = length(size2)
    ntuple(max(len1,len2)) do i
        s1 = i ≤ len1 ? size1[i] : 1
        s2 = i ≤ len2 ? size2[i] : 1
        max(s1, s2)

reduce(unisize, map(size, xs)) == size(y)

I am asking because I would rely on this property in library code. I am asking about this invariant as an informal requirement (ie should third-party definitions of broadcast respect it as “good manners”, not something formally enforced by the language).


I’m loving this feature! It just became my favorite feature of Julia. I think this really should be emphasized in the documentation as the central feature of Julia. For me it puts Julia in a league of its own in the data science languages space.

Kudos to the Julia team.


I agree with @Steven_Sagaert! I was just at a conference trying to get other ecologists to make the switch to Julia, and this feature really had them picking up their ears. E.g. the ability (which is not fusion and already exists in 0.5) to generate all pairwise results of a binary function over an array by f.(a, a') is very intuitive and really illustrates the power that julia gives you over array operations.


Great post Steven. And thanks for pushing this feature forward.

Loop fusion is, by the way, also a much discussed thing in all those C++ template metaprogramming libraries.


Yes. However, as documented in the manual, it only expands singleton dimensions (with missing dimensions treated as singletons). If you have two arrays with a non-matching non-singleton dimension, it throws an error. (This is the same as the behavior of broadcasting in NumPy and Matlab.)

Also, in 0.6, except for a few exceptions (AbstractArray, Tuple, RefValue, …), objects are treated as 0-dimensional arrays (“scalars”) by default. (They do not need to support a size method.)


Yes, in fact this kind of thing is mentioned in the blog post, in the halfway solutions section — it’s the technique pioneered in the Veldhuizen (1995) reference cited there. The main problem is that this technique only works if you use a specific container template class, and only for specific functions/operations provided by the author of that container class.


The fact that this approach to vectorization extends to both user-defined operations and user-defined types is crucial and hard to express the importance of sufficiently. This problem is like the “expression problem” but for vectorization. We’ve considered various approaches to “user extensible vectorization” since the inception of the project but until now we never had a good solution. This is, imo, a perfect example of why, when you don’t have a design that really clicks, the best approach is often to wait until something better becomes clear – language design takes patience. Thanks for the blog post and championing the feature and its implementation, @stevengj!

Another key aspect of this feature is that vectorization is explicitly expressed by the programmer, meaning that fused, vectorized behavior and performance can be relied on no matter what the compiler may optimize or not in the future. I can now imagine autovectorization as a lint pass – instead of or in addition to some amount of autovectorization: instead of trying to prove that this is a correct optimization, a linter can detect places where fused vectorization seems likely to be correct and more efficient and suggest the change to the programmer. If they accept the transformation as valid and desirable, it becomes explicit and permanent.

One additional feature that seems desirable is the introduction of a @. macro which would vectorize every call in an expression since that’s pretty often what one wants and gets a bit tedious to write with dots on every call and operator.


Yes. That is a very good idea. Even if it’s just for a single line, it could help readability a lot for codes which ". everything"


The main reason it’s not a single-line change is that @. is currently a syntax error :grin:


One difficulty with the @. idea is that you often don’t want to add . to everything. e.g. you don’t want it for sum. One option is something like @devec: you have a list of functions and operators that it converts to broadcast calls, while leaving everything else alone. (Just adding dots to operators like * and + would be a win for clarity.)


Let’s see how it plays out, but I’ve seen a lot of examples (but they’re just examples!) where literally everything gets dotted. In cases where some things get dotted and others don’t I’m perfectly happy with spelling it out. Of course, there could be some syntax in @. for undotted calls, but that certainly seems like getting ahead of ourselves.


After reading Steven’s interesting write-up, I tried to use this new feature in order to implement y = A * x, where A is a sparse matrix and x is a dense vector, by just y .= A * x without any memory allocation. My approach almost worked… but one thing went wrong described below.

My approach: I created a new type T that is a wrapper of SparseMatrixCSC{Float64,Int}. Creating an instance of this wrapper stores a copy A' (transpose). So my code first says A_lazy = T(A), and then the actual call is y .= A_lazy * x. I defined multiplication between A_lazy of type T and a vector x to do nothing except store a copy of A_lazy and x in an object of type SparseMVMult, where SparseMVMult <: AbstractArray{Float64,1}. Finally, I define getindex for SparseMVMult to actually carry out the multiplication to produce a single entry of the product vector Ax. The idea is that pushing all the work of matrix-vector multiplication to getindex means that no memory needs to be allocated for the computation. The reason for storing A' rather than A in T is that the CSC format is by columns, whereas to generate a single entry of Ax efficiently requires rows of A to be collected.

It almost worked, except the object of type SparseMVMult itself gets allocated on the heap rather than the stack, so this implementation of y=A*x still needs to allocate memory on the heap.

Possibly there is a workaround?

If not, I should mention that the Julia core developers have already put some effort into allocating small non-bits types on the stack rather than the heap whenever the compiler can prove that they never “escape”. This use-case of creating lazy objects that have a getindex method for use in the new broadcast-vectorization in 0.6 may deserve additional attention.


Maybe it could live in a library under a different name, eg @broadcast, so that users can play around with it?


The cost of allocating a tiny object on the heap should be negligible compared to the cost of A*x for matrices of any significant size, so I’m not sure why you are concerned?