100X tensor algebra speedup!?!



Found a very interesting article:
System for performing ‘tensor algebra’ offers 100-fold speedups over previous software packages

I am a newbie in linear algebra but this sounds like something that would be nice to implement in julia. Just curious to hear others’ thoughts on this.


Here’s a search that may get you started.


Very interesting! Is there any difference in what is being implemented here:

…from what’s reportedly being done in the the article I mentioned? My knowledge is very limited but the only difference I can see is nonzero values being “compiled” vs. being iterated. Is that correct? Would there be a significant speed difference for compiled Vs. iterated?

Thanks for your thoughts.


Julia already compiles specialized implementations for its arguments. The only “trick” here is switching the algorithm from an O(all_elements) to O(nonzeros) implementation — that’s precisely what eachstored (or some such) could do. The 100x claim is meaningless — it all depends upon how many elements you’re working with since you’re changing the complexity.


Yes, usually numbers like that tend to be (for several reasons) just generalizations in spite of being numbers. I was more interested in the question of whether there is anything in the idea that could have application in julia. Any speedup is good. But from appearances we have something similar.

Thanks for explaining it to me :grin: Understanding this may help me to write better code when I do work with sparse data.


Ah, glad to see taco getting a mention here!

The taco team is actually right down the hall from the Julia Lab at MIT. They’re a great bunch of engineers and, from what I’ve seen, taco is a really cool bit of tech. Folks in the Julia Lab are actively experimenting with integration/use of taco with/from Julia, and I’m personally quite excited to see what will come out of it.


Very cool! I should have clicked through the “amazing, switching algorithmic complexities results in huge speedups!” to see the actual work they’re doing. Can’t you get them to start writing it in Julia directly? :slight_smile:



Hi there! I’m Peter, a graduate student down the hall from the TACO team quietly working on integrating TACO into Julia. Indeed, TACO does a few things which we can’t do with Julia sparse matrix implementations. TACO would allow us to unify the sparse matrix CSR, CSC, DCSR, DCSC, and sparse vector implementations into a generalized representation of sparse arrays which extends to higher order tensors. (Also of course we can use TACO to write very cool optimized sparse tensor kernels)

It would be really fun to write TACO in Julia, but since I am only a mortal graduate student with a lot on my plate, the implementation of TACO will remain in C++ (so that the original TACO team can continue to maintain it). My plan for integrating TACO into Julia has two phases. Phase 1 is implementing an interface which calls TACO through CXX.jl and executes the C++ kernels. Phase 2 is extending TACO to use a Julia backend so that we can use the C++ tensor compiler to write Julia code which gets returned as the thunk of a generated function. Phase 2 will allow us to use custom datatypes and reductions in our generated kernels.


Hi! I’m Fred, one of the taco developers. Also glad taco is mentioned here! I love Julia, and am really excited by the work Peter is doing.

The 100x number is taken from our academic paper and is the speedup we got over the Matlab tensor toolbox for several kernels, which as far as we know is the only system that supports the full tensor algebra with sparse tensors. We also compare many kernels that someone has hand-optimized, e.g. in Intel MKl, and for those our performance is comparable to the hand-optimized versions.

Our edge is that we compile down each expression to a custom kernel optimized for that particular expression. We also do this for compound expressions with many sub-expressions. So for the expressions nobody has hand-coded yet, you can still get hand-coded performance using taco.

We think Julia and taco are a great fit. Programmers can develop with Julia, which is really nice, and taco can run in the background, transparently generating custom kernels to compute linear and tensor expressions.


Are you thinking about a lazy wrapper for arrays that just build up the taco format and lazily compute the result only when requested (by generating and dispatching on a resulting build-up type)?
Alternatively in the future you could exploit Cassette.jl functionality.


I’m a little unsure about what you’re suggesting, but I think it will help if I clarify the design. I have created a sparse tensor datatype in Julia which mirrors the taco implementation. I have not settled on the user interface, but it will essentially be something which exposes index notation (similar to TensorOperations.jlTokamak.jl, ArrayMeta.jl, or others) and is able to pass the index expression to TACO. When it comes time to compute, the Julia tensor is converted to a C++ TACO tensor, the TACO-generated kernel is called on that tensor, and then the result is passed back to Julia. In Phase 2, we won’t need to pass the tensor to C++ because TACO will generate a Julia kernel.