CSC kills the prospect of multithreading. Shouldn't Julia use CSR?

nvenkov1 · August 4, 2023, 3:22pm

I recently did a rather extensive study of performance of matrix-vector (SpMV) products of sparse matrices with several different storage formats. Although I started benchmarking things in Julia, as usual, I eventually transferred to C. What I realized is that SpMV is very straightforward to parallelize when using CSR and, despite the fact that SpMV is memory-bound, a sizable speedup can be expected when using the CSR format with multiple threads. On the other hand, what I realized is that using the CSC (which is used by default in Julia) and COO formats makes it more complicated to parallelize SpMV. Still, one can make use of atomic operations or array reductions and hope for some speedup. However, my experiments suggest that, even then, no speedup can be expected and only significant slowdown occurs.

Let me document my attempts to parallelize SpMV for the CSC format.

First, I created a struct for the CSC format:

typedef struct SparseMatrixCSC {
  int m;            // Number of rows
  int n;            // Number of columns
  int nnz;          // Number of stored values
  double *val;      // Stored values
  int *row_idx;     // Row indices of stored values
  int *col_start;   // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;

Then, the serial implementation of the corresponding SpMV is straightforward:

void dcscmv(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  for (int j=0; j<A->n; j++) {
    for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

To parallelize SpMV, one must avoid race conditions. A straightforward attempt to do so is to force synchronizations with atomic operations as follows:

void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for
  for (int j=0; j<A->n; j++) {
    for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
      #pragma omp atomic
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

Another way to proceed is to do an array reduction. Since OpenMP 4.5, there’s a built-in array reduction feature which we can use as follows:

void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for reduction(+:y[0:A->m])
  for (int j=0; j<A->n; j++) {
    for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

Unfortunately, I had problems running my code with dcscmv_builtin_array_reduction, so that I also implemented an array reduction from scratch as follows:

void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  double *YP;
  #pragma omp parallel 
  {
    int P = omp_get_num_threads();
    int p = omp_get_thread_num();
    #pragma omp single
    {
      YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
      for (int i=0; i<A->m*P; i++) YP[i] = 0.;
    }
    #pragma omp for
    for (int j=0; j<A->n; j++) {
      for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
        YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
      }
    }
    #pragma omp for
    for (int i=0; i<A->m; i++) {
      for (int p=0; p<P; p++) {
        y[i] +=  YP[A->m * p + i];
      }
    }
  }
  mkl_free(YP);
}

My numerical experiments with a couple of sparse matrices show that both dcscmv_atomic and dcscmv_array_reduction_from_scratch lead to significant slowdowns compared to dcscmv, irrespective of the number of threads.

Hence, it is rather clear that parallelizing SpMV with a CSC format is not useful, contrarily to the case with CSR. Interestingly, when performing experiments with the mkl_sparse_d_mv function from MKL, one can see that the MKL developers did parallelize the function for CSR and BSR, but not for CSC and COO.

Finally, my question is about the choice of CSC as a default, and only natively supported sparse matrix format in Julia. Given that Julia is designed for optimized performance and that it extensively makes use of multithreading in its BLAS implementation, why was the choice made to use CSC over CSR, since the latter would allow for so much better multithreaded performance.

I can forsee some users telling me that Julia is column-major. This, I believe, is a false excuse, since under the hood, the data structures for either CSC and CSR matrices are 1D. For example, MKL and Scipy allow both for CSC and CSR representations, and the performance of SpMV is nearly the same irrespective of the format. The only major difference is that using CSR allows for speedup in multithreaded applications.

gdalle · August 4, 2023, 7:13pm

I can’t answer as to the reasons of the choice, but I believe this thread is relevant:

And it looks like someone coded the multithreaded version in Julia too:

fredrikekre · August 4, 2023, 7:49pm

See also GitHub - jagot/ThreadedSparseArrays.jl

stevengj · August 4, 2023, 8:52pm

As you mentioned, MKL implements parallel sparse BLAS for both CSC and CSR, I believe. Have you checked on its parallel performance in case you are missing a trick?

In any case, one option is simply to use transpose(A), since Transpose{T, <:SparseMatrixCSC{T}} is effectively a CSR sparse matrix.

nvenkov1 · August 4, 2023, 9:12pm

No, I said MKL implements multithreading for CSR and BSR, but not for CSC and COO. Increasing the number of threads has no effect whatsoever on mkl_sparse_d_mv with CSC and COO, it only affects it with CSR and BSR.

Yes, I understand how you can get a CSR by transposing a CSC, although you may only want to do this if you intend to perform sufficiently many SpMVs to amortize the cost.

My point was that Julia’s BLAS has several functions which are multithreaded, so that it would be fair to expect seamless native multithreading of SpMV. But as I understand, this is only useful with CSR, and not with CSC. So I was wondering why this choice was made to make CSC the default and only native format, as this does not seem to be the best choice for multithreaded performance.

I checked the video pointed out by @gdalle, but Viral B. Shah does not really justify the choice. He simply says that CSC is used by default and some users have been asking for support of CSR.

I still have to benchmark and read the code in jagot/ThreadedSparseArrays.jl referred to by @fredrikekre. I’d be very surprised if it shows even decent scaling as both my code and MKL’s benchmark suggest multithreading CSC’s SpMV is useless at best.

gdalle · August 4, 2023, 9:24pm

I seem to remember reading that this is not as efficient as it looks, eg here

But I don’t know what those benchmarks are worth

fredrikekre · August 4, 2023, 9:25pm

This is what ThreadedSparseArrays.jl does, and implements a threaded version of A::Transpose{T, SparseMatrixCSC{T}} * b::Vector, see https://github.com/jagot/ThreadedSparseArrays.jl/blob/2c96928094a03acda49d419b1146d0dcef59c546/src/ThreadedSparseArrays.jl#L123-L144

nvenkov1 · August 4, 2023, 9:28pm

So you’re saying ThreadedSparseArrays.jl converts from CSC to CSR by transposition and implements multithreading on the resulting CSR matrix? That would make sense, because multithreading directly on the original CSC matrix does not work.

stevengj · August 5, 2023, 12:02am

There doesn’t need to be any cost. transpose(A) is a zero-copy wrapper. So, if you want a CSR array you simply call transpose(sparse(J, I, V)) (or similar) instead of sparse(I, J, V) when the array is constructed.

If you want to make this more transparent you could have aliases like sparsecsr(I, J, V) = transpose(sparse(J, I, V)). (You could even make a SparseArraysCSR package that is a thin wrapper around SparseArrays which exports the same API but defaulting to CSR, simply by transpose wrappers.) The point is that Julia does not need a CSR type — we already have one. (It might benefit from more methods specialized to that type, however.)

I think it’s probably because CSC is used by CHOLMOD and other SuiteSparse functions, which is the sparse-direct library that Julia bundles.

nvenkov1 · August 5, 2023, 2:14am

I disagree. First, let’s assume, as you did, that you are equipped with a COO representation of the matrix, i.e., you know I, J, V such that A = sparse(I, J, V). Then, to obtain B, the CSC representation of transpose(A), you suggest to call B = sparse(J, I, V), in which case you need to perform a conversion from COO to CSC. This is not a given and can cost the runtime of several SpMVs, i.e., 2-5 SpMVs in the cases I tested in C. If you are not convinced, think about it, how are the non-zero values ordered in a CSR representation? Is it the same way as in CSC? No, it’s not, so there’s work to do to generate the data structures of the CSR representation from CSC.

Interestingly, I just checked the function sparse! at line 1134 in JuliaSparse/SparseArrays.jl/blob/main/src/sparsematrix.jl which is invoked when you call A = sparse(I, J, V) and B = sparse(J, I, V), and as it turns out, the data structures of an unsorted-row CSR representation are generated during the process of building the CSC representation. This is a surprise to me, as this is not how I implemented my conversion from COO to CSC, nor how I remember it to be done in the underlying C++ code of SciPy. I will probably do a benchmark later, but I do have the impression that the call sparse(I, J, V) may be sub-optimal.

Concerning the choice of CSC and its potential motivation for the use of CHOLMOD in SuiteSparse, you might be right, but I have some difficulty finding this information. I don’t know SuiteSparse very well, so it will probably take me a while to figure whether CHOLMOD requires the sparse matrix to be in CSC or not. Thank you for your input.

stevengj · August 5, 2023, 3:21am

You’re missing the point, I think. You never need to construct the CSC matrix in the first place. Whatever method you are using to construct a CSC sparse matrix can equally well be applied to construct the transpose of the matrix, which you can then wrap a Transpose wrapper around at zero cost.

I gave the COO constructor merely as an example of one way people populate CSC matrices.

nvenkov1 · August 5, 2023, 9:13am

I understand that the command B=transpose(A) has no cost associated with it as it does not generate a new CSC matrix but rather declares B as an instance of Transpose{T, SparseMatrixCSC{T}}.

Then, if you perform a product B*x with some x::Vector{T}, you are not invoking the CSC SpMV function, but rather the CSC SpMV_T function. However, although the CSC SpMV_T and CSR SpMV functions have in common that they both can easily be parallelized without facing race conditions, they are not equivalent in that they both lead to distinct sequences of floating point operations and, more importantly, do not yield the same result.

Indeed, B*x is not what we want to compute. What we do need is A*x, i.e., transpose(B)*x. One way to proceed in order to really emulate the behavior of a CSR format, is to assemble the transpose doing something like B=sparse(transpose(A)), and then compute transpose(B)*x, in which case the CSC SpMV_T function on B will indeed be equivalent to a CSR SpMV on A. The thing with this is that, as I explained, executing B=sparse(transpose(A)) is not free.

mkitti · August 5, 2023, 9:28am

B = transpose(sparse(A)) is equivalent and free though. So perhaps what you want is for us to specialize sparse for Transpose?

julia> using LinearAlgebra, SparseArrays

julia> SparseArrays.sparse(A::Transpose) = transpose(sparse(parent(A)))

julia> A = rand([0, 1, 2], 6, 6)
6×6 Matrix{Int64}:
 1  0  2  0  2  1
 0  1  2  0  2  1
 0  0  1  2  0  2
 0  0  2  2  0  2
 0  0  2  1  2  2
 1  2  0  2  1  2

julia> tA = transpose(A)
6×6 transpose(::Matrix{Int64}) with eltype Int64:
 1  0  0  0  0  1
 0  1  0  0  0  2
 2  2  1  2  2  0
 0  0  2  2  1  2
 2  2  0  0  2  1
 1  1  2  2  2  2

julia> sparse(tA)
6×6 Transpose{Int64, SparseMatrixCSC{Int64, Int64}} with 23 stored entries:
 1  ⋅  ⋅  ⋅  ⋅  1
 ⋅  1  ⋅  ⋅  ⋅  2
 2  2  1  2  2  ⋅
 ⋅  ⋅  2  2  1  2
 2  2  ⋅  ⋅  2  1
 1  1  2  2  2  2

nvenkov1 · August 5, 2023, 9:36am

I’m sorry, but you are missing the point. B=transpose(sparse(A)) is no different than B=transpose(A) as A is an instance of SparseMatrixCSC{T}. Then, computing transpose(B)*x with x being some instance of Vector{T} is just the same thing as computing A*x, so it serves no purpose, as you still invoke the CSC SpMV function.

CSC and CSR are not the same thing and it is not free to convert from one to the other.

mkitti · August 5, 2023, 10:42am

The only thing missing is wrapping B into a type that dispatches multiplication to CSR algorithms. This js exactly what SparseMatricesCSR.jl does.

github.com

gridap/SparseMatricesCSR.jl/blob/master/src/SparseMatrixCSR.jl#L53-L56


      
          function SparseMatrixCSR(a::Transpose{Tv,<:SparseMatrixCSC} where Tv)
            at = a.parent
            SparseMatrixCSR{1}(size(a,1),size(a,2),at.colptr,rowvals(at),nonzeros(at))
          end

It takes a lazily transposed CRC matrix and says to use CSR algorithms on it. The resulting CSR matrix shares the same memory as the original CRC matrix. They are just views of each other.

julia> using SparseMatricesCSR

julia> B = sparse(tA)
6×6 Transpose{Int64, SparseMatrixCSC{Int64, Int64}} with 23 stored entries:
 1  ⋅  ⋅  ⋅  ⋅  1
 ⋅  1  ⋅  ⋅  ⋅  2
 2  2  1  2  2  ⋅
 ⋅  ⋅  2  2  1  2
 2  2  ⋅  ⋅  2  1
 1  1  2  2  2  2

julia> B_CSR = SparseMatrixCSR(B)
6×6 SparseMatrixCSR{1, Int64, Int64} with 23 stored entries:
  [1, 1]  =  1
  [1, 6]  =  1
  [2, 2]  =  1
  [2, 6]  =  2
  [3, 1]  =  2
  [3, 2]  =  2
  [3, 3]  =  1
  [3, 4]  =  2
  [3, 5]  =  2
  [4, 3]  =  2
  [4, 4]  =  2
  [4, 5]  =  1
  [4, 6]  =  2
  [5, 1]  =  2
  [5, 2]  =  2
  [5, 5]  =  2
  [5, 6]  =  1
  [6, 1]  =  1
  [6, 2]  =  1
  [6, 3]  =  2
  [6, 4]  =  2
  [6, 5]  =  2
  [6, 6]  =  2

julia> B_CSR[1] = 5
5

julia> B[1]
5

julia> @which B_CSR * collect(1:6)
*(A::SparseMatrixCSR, v::Vector)
     @ SparseMatricesCSR ~/.julia/packages/SparseMatricesCSR/mbjcs/src/SparseMatrixCSR.jl:325

ThreadedSparseArrays.jl, as previously mentioned, then uses the SparseMatrixCSR for paralellization. We say this is basically free since the CSR type shares the identical internal data structures as the CRC type.

stevengj · August 5, 2023, 11:37am

You’re not making any sense to me.

If B = A^T, then if A is stored in CSC and B is stored in CSR, then A^T v = (v^T A)^T = B v are mathematically equivalent operations, the CSC storage for A is exactly the same as the CSR storage for B, and transpose(A)*v can computationally be implemented with the same algorithm as B*v that yields the same result.

(Obviously, Ax and Bx are not the same, but that’s irrelevant. The point is that if what you want is Bx, then you assemble A as B^T in CSC format before calling B = transpose(A) — which doesn’t actually move any data, it just wraps a Transpose type around A, so that computing B*x is actually done by (x^T A)^T, exactly as if B were CSR. You never assemble B in CSC format. Hence the transpose is indeed free.)

nvenkov1 · August 5, 2023, 12:30pm

You misunderstood. How can you argue with the fact that CSC SpMV_T and CSR SpMV are different things when you say yourself “Obviously, Ax and Bx are not the same”? Let me rephrase it for you: CSC(A) SpMV_T and CSR(A) SpMV are two different things. Whereas, obsiouvly, CSC(A^T) SpMV_T and CSR(A) SpMV are indeed the same thing, but not what I was talking about.

Let me rephrase my statement for you. We are only interested in y=A*x. In order to compute y in a way equivalent to using a CSR SpMV, you have to do the following: C=sparse(transpose(A)); y=transpose(C)*x. But, once again, C=sparse(transpose(A)) is not free. If, however, all you do is B=transpose(A); y=transpose(B)*x then you are still using the CSC SpMV and what you do is useless.

Julia developers did not invent a magic trick to generate the data structures of a CSR representation for free from a CSC representation, such a thing does not exist because you cannot reorder arrays for free. Try to write your own code to generate a CSR data structure from a CSC representation without doing any work. Quickly enough, you should understand this is not possible.

nvenkov1 · August 5, 2023, 1:03pm

For the third time, B = sparse(tA) is not free. It can take as much time as 7 SpMV calls in Julia.

I find it frustrating that I came with a question as to why Julia does not provide built-in support of CSR so as to enable multithreading of SpMV in SparseArrays.jl. And somehow, people are trying to make me believe that Julia does not need a CSR format because its developers found a magic trick to generate the data structures of a CSR matrix for free. Such a thing does not exist. CSC and CSR are not the same thing. The arrays of non-zero values are ordered very differently in each format. It is not possible to reorder an array for free. I cannot believe several people are convinced of such nonsense and try to disseminate this belief.

lmiq · August 5, 2023, 1:14pm

I’m interested in the discussion, and I think this will become clear, in whatever aspect, if minimal running examples are provided.

mkitti · August 5, 2023, 1:14pm

The most direct answer is that we find GitHub - gridap/SparseMatricesCSR.jl: Sparse matrices in CSR format for Julia computations sufficient and do not see any significant advantages of having that package be “built-in”.