@Stephen_Vavasis, the point of ArrayIteration.jl is to be able to write a single generic multiplication algorithm and have it perform well for Dense*Dense, Dense*Sparse, Sparse*Dense, and Sparse*Sparse. The problem you describe is not limited to SparseMatrixCSC: we have other sparse matrix types too (UniformScaling, Diagonal, Bidiagonal, Tridiagonal, and many others in packages), and so supporting efficient implementations of every possible combination is mildly insane—using ordinary dispatch you’d have to implement N^2 methods for N types. It gets even worse when you consider various types of views (view, reshape, PermutedDimsArray, etc) of each of the different sparse types. The only hope is to turn an O(N^2) problem into an O(N) problem by defining iterators for each type and then writing algorithms that leverage these iterators.
The only downside is that ArrayIteration isn’t finished. Now that 0.7 has better inlining, the things that were blocking previously (heap allocation of immutables that contain references to arrays) may not be a problem anymore, so it may just take someone with a bit of spare time (currently, not me ) to finish it off.
I also think it’s far more important to have things work than it is to catch performance bugs, so I too disagree with the proposal here. @Stephen_Vavasis, in the short term is there any chance that you could catch your issues locally by doing the following?
julia> eval(Base, quote
function getindex(A::SparseMatrixCSC{T}, i0::Integer, i1::Integer) where T
error("using the generic fallback, fix the caller")
end
end)
Now any method that uses S[i, j] for a SparseMatrixCSC S will throw an error. You can define this locally until you’re confident that you’ve cornered every last performance problem.
I am also quite excited about Cassette, but I don’t think it will help here. Based on my experience with IProfile (which is admittedly very crude), that one minute of setup time could become 10 minutes. The sampling profiler is much easier to make “invisible” in terms of performance cost.
Instrumentation profilers are very popular when profiling for example games and they can commonly profile in real time while still allowing the game to run at 60 fps (see Telemetry Performance Visualization System). Why do you assume that similar things could not be done for Julia code?
It would be great if this can be pulled off. I don’t know how those profilers work, but to me the bottom line is that if you add instrumentation—e.g., push!ing data to an array—even to fast, cheap functions like v[i] where v::Vector{T}, then there’s no hope of avoiding a performance hit. If those fast, cheap, inlineable functions dominated your runtime even without instrumentation, you could easily see a 10x hit with the instrumented version.
But if we can impose some kind of cutoff (e.g., avoid adding instrumentation to functions whose estimated runtime is less than 60 CPU cycles), then it might be more viable. EDIT: perhaps an easy option would be to add instrumentation only to :invoke expressions in CodeInfo objects.
I think the problem here is not that SparseMatrixCSC <: AbstractArray, its that AbstractArray is used far too much. If you consider
Dense matrices
Sparse matrices
Shared memory matrices
Distributed memory matrices
Non-stored matrices (for example, Ranges)
they have very little in common. Even get/setindex is not defined in all cases! For Ranges, setindex! isn’t defined at all, for distributed memory matrices, getindex may not be defined for the non-local part of the matrix. For shared memory matrices, setindex! might need locks, making it (potentially) very inefficient. The point I’m making is that this issue of fallbacks isn’t really specific to sparse matrices, its a problem with anything that isn’t a dense, single-process matrix. I think a lot of the AbstractArray interface should really be the DenseMatrix interface, and when there is commonality between DenseMatrix and one of the other matrix categories, they should present the same interface.
edit: thanks to @ChrisRackauckas for pointing out Ranges have getindex but not setindex!
I do think there’s something to be said here, but it really needs traits. Being dense is one trait of an array. There are a lot of algorithms which do at a very course level “work with AbstractArrays”, and the purpose of <:AbstractArray is more than just fallbacks. So <:AbstractArray still makes sense, but can be refined in multiple (orthogonal) ways via a set of traits. And there have already been some proposed traits for things like “can setindex!” (which is different than being mutable since an immutable AbstractArray type can hold a mutable array! See wrappers like GPUArrays). We just need an enlarged vocabulary to handle this appropriately.
+1 to the idea of using traits for this, but I think a prerequisite (and something we can do now) is to define a formal API for arrays: that is, what are the methods one should expect to be able to call on all subtypes of AbstractArray (or, going down one level, DenseArray / SparseArray)? That will give developers a sense of what they should make performant. I have implemented my own “Array-like” datatypes but still am not sure I’ve implemented all the methods that “arrayness” requires.
I go back to my own experience with this: there are some graph types that, no matter what you do, will never have good performance on some functions. In these cases we opted to keep the methods available, but warn the user that what they want to do is not performant. I think this is the best we can do.
I think this is the right approach as well. GPUArrays.jl is doing this by allowing indexing in a GPUArray. Looking through a linear index on a GPUArray is a horrible idea. However, it allows you to do things like get the first value. In OrdinaryDiffEq.jl (the ODE solvers for DifferentialEquations.jl) we need this exactly once for the calculation of the initial dt, and the rest of the code all uses broadcast (with the appropriate solver algorithms). Thus in these cases using DifferentialEquations.jl with a GPUArray actually works, and it’s fine that first is used since it’s only used exactly once in what is normally a very long computation, whereas every other interaction with the GPUArray is broadcasted. If this fallback didn’t exist, we would still have some incompatibilities with GPUArrays even though the fallback doesn’t really have an effect on our actual runtime in any real-world case.
I think this is a good example of “allowing users to shoot themselves in the foot” is actually allowing users to do anything that’s remotely possible, and letting them decide the manner that is appropriate (and over time optimizing more and more ways of working). While I wouldn’t use these sparse matrix fallbacks in a tight loop, in a lot of cases they will work in a pinch and will get an implementation off the ground and working. Then in the future it can be specialized and optimized more, but maybe the “bad call” only happens once, in which case spending time to optimize it is most likely just a waste of time anyways.
The one thing I think can be made better is the communication that something is a bad fallback. I do wish we had a way of marking “only do this if necessary”. A warning is too much if it can’t be turned off. Requiring someone to profile it is a little too silent. There must be something in the middle.
What about something like performant_types(func) to give the set of type tuples that func is performant for? This can then be recursively analyzed to figure out something like is_performant(foo, Array, SparseMatrixCSC). Analysis can be enforced by running in something like analysis mode whose macro can be removed once the user is happy or ok with being unhappy with the performance of the functions. And of course the word performant here is about as trustworthy as the person claiming it.
I don’t think that a solution involving warnings or a new function performant addresses my original concern. The fallback of sparse-matrix methods to dense-matrix loops is because of bugs in 0.6.0, not because the Julia developers didn’t get around to writing the methods I need. (The methods I need are present but aren’t getting called. The reason for the bugs is that the dispatch in abstractarray.jl and related base files is very complicated, so it’s easy to see how something could slip through the cracks.) It is unreasonable to expect the Julia developers to implement warnings for bugs.
However, there is a solution akin to warnings that would address my concerns. Consider, for example, repmat that appears in abstractarraymath.jl:
function repmat(a::AbstractVecOrMat, m::Int, n::Int=1)
o, p = size(a,1), size(a,2)
b = similar(a, o*m, p*n)
for j=1:n
d = (j-1)*p+1
R = d:d+p-1
for i=1:m
c = (i-1)*o+1
b[c:c+o-1, R] = a
end
end
return b
end
I suggest that b in this function could be initialized to Array{eltype(a),2}(o*m,p*n) rather than using similar. In this way, if a sparse matrix routine accidentally falls back to the AbstractArray implementation, then a power-user will know right away because a dense matrix is returned. This approach has the benefit that no functionality is lost for the beginning user who just wants everything to work.
I must have missed this (sorry): what methods specifically are converting to dense arrays? repmat doesn’t at any point, as far as I can see: similar() on a sparse matrix returns a sparse matrix.
If you’re saying that the indexing being performed in repmat is not performant in sparse matrices, then a new method that dispatches on a sparse matrix that utilizes sparse-performant indexing would solve this problem, no?
The second bug is currently present in 0.6.0-release but has already been fixed in a PR. I don’t know whether the PR will go into 0.6.1 or 0.7. It is described here:
I don’t know whether repmat is performant or not for sparse matrices-- I didn’t check. The point of my previous post is that the widespread use of similar to construct return arguments in abstractarray.jl and abstractarraymath.jl creates very fertile ground for bugs of the kind I am complaining about. I am complaining about bugs in the dispatch that cause functions to mistakenly and silently fall back to a dense routine. My latest proposal, namely, replacing all uses of similar with dense-arrays constructors would simultaneously keep functionality for beginning users while providing a strong clue to power-users that a non-performant operation is present in their codes.
I see your point. So if the problem is with the loops, I guess making more extensive use of ArrayIteration.jl in Base may cut it, which if I understand correctly would separate the semantics of iterating over an array or parts of an array from the detailed implementation that may be optimal for some array type and terrible for others. This “sparsity-aware programming” is something that probably needs to be more prevalent in Base and other AbstractArray based functions. I am not sure if that is addressing your issue here or not. I suppose there could be other temporary measures as well like the one you mentioned which just explicitly changes to a dense matrix before diving in a loop that assumes a dense matrix anyways. That would be like screaming: “I am not performant for sparse matrices.”
Couldn’t you then redefine the type to subtype as is now done (or vice versa drop the subtype in your code)? You would need to make sure to only use when testing.
Counterpoint: Those who have your point of view would do it from the start in their code(?), but that seems bad if you use such a package, and it would ruin for everybody.
[I’m assuming type relationship is global; can’t be changed locally…]
Blockquote
I guess making more extensive use of ArrayIteration.jl in Base may cut it,
Being able to iterate over only the stored values in a sparse matrix may help some cases, but I don’t think its a general solution. For example, I don’t think its possible to implemented any kind of supernodal algorithm (for example, the supernodal Cholesky algorithm in SuiteSparse) using this approach.
One issue unique to spare arrays (I mean SparseMatrixCSC: not e.g. Diagonal, UniformScaling etc.) is they are not fully IEEE compliant. There’s a Github issue on it; on -0.0 a special case for them: