I recently did a rather extensive study of performance of matrix-vector (SpMV) products of sparse matrices with several different storage formats. Although I started benchmarking things in Julia, as usual, I eventually transferred to C. What I realized is that SpMV is very straightforward to parallelize when using CSR and, despite the fact that SpMV is memory-bound, a sizable speedup can be expected when using the CSR format with multiple threads. On the other hand, what I realized is that using the CSC (which is used by default in Julia) and COO formats makes it more complicated to parallelize SpMV. Still, one can make use of atomic operations or array reductions and hope for some speedup. However, my experiments suggest that, even then, no speedup can be expected and only significant slowdown occurs.
Let me document my attempts to parallelize SpMV for the CSC format.
First, I created a struct
for the CSC format:
typedef struct SparseMatrixCSC {
int m; // Number of rows
int n; // Number of columns
int nnz; // Number of stored values
double *val; // Stored values
int *row_idx; // Row indices of stored values
int *col_start; // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;
Then, the serial implementation of the corresponding SpMV is straightforward:
void dcscmv(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
To parallelize SpMV, one must avoid race conditions. A straightforward attempt to do so is to force synchronizations with atomic operations as follows:
void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
#pragma omp atomic
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
Another way to proceed is to do an array reduction. Since OpenMP 4.5, there’s a built-in array reduction feature which we can use as follows:
void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for reduction(+:y[0:A->m])
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
Unfortunately, I had problems running my code with dcscmv_builtin_array_reduction
, so that I also implemented an array reduction from scratch as follows:
void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
double *YP;
#pragma omp parallel
{
int P = omp_get_num_threads();
int p = omp_get_thread_num();
#pragma omp single
{
YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
for (int i=0; i<A->m*P; i++) YP[i] = 0.;
}
#pragma omp for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
}
}
#pragma omp for
for (int i=0; i<A->m; i++) {
for (int p=0; p<P; p++) {
y[i] += YP[A->m * p + i];
}
}
}
mkl_free(YP);
}
My numerical experiments with a couple of sparse matrices show that both dcscmv_atomic
and dcscmv_array_reduction_from_scratch
lead to significant slowdowns compared to dcscmv
, irrespective of the number of threads.
Hence, it is rather clear that parallelizing SpMV with a CSC format is not useful, contrarily to the case with CSR. Interestingly, when performing experiments with the mkl_sparse_d_mv
function from MKL, one can see that the MKL developers did parallelize the function for CSR and BSR, but not for CSC and COO.
Finally, my question is about the choice of CSC as a default, and only natively supported sparse matrix format in Julia. Given that Julia is designed for optimized performance and that it extensively makes use of multithreading in its BLAS implementation, why was the choice made to use CSC over CSR, since the latter would allow for so much better multithreaded performance.
I can forsee some users telling me that Julia is column-major. This, I believe, is a false excuse, since under the hood, the data structures for either CSC and CSR matrices are 1D. For example, MKL and Scipy allow both for CSC and CSR representations, and the performance of SpMV is nearly the same irrespective of the format. The only major difference is that using CSR allows for speedup in multithreaded applications.