I am capturing this conversation from slack here as we integrate GraphBLAS into Julia with permission from @DrTimothyAldenDavis.
-viral
–
I hope people would not mind if I ask a naive question. If GraphBlas is a generalization of multiplication by sparse matrices to other reduction operators, how does it differs (or corresponds to) scatter / gather operation used in GeometricFlux. I am asking as we have implemented similar operations in Mill.jl (sum, mean, max).
GraphBLAS can easily do those kinds of computations. For example, say A is a sparse matrix, and you want to apply a monoid (min, max, plus, times, logical or, etc) to each row. Then you can use GrB_reduce to do that, or you can also (equivalently) compute r = A*x where x = ones(n,1), using the MIN_FIRST semiring if you want the MIN of each row. Currently, constructing x = ones(n,1) takes O(n) time and memory in v5.0.6 of GraphBLAS, but in v5.1 it will take O(1) time and memory to create this vector x, regardless of n.
1
@Tim Davis when you said above C(I,J)=A is up to 1000x faster in GraphBLAS as compare to MATLAB, and C(M)=A where M is a logical matrix the same size as C and A is up to 100,000x faster
, is that for sparse matrices stored in CSC format?
Yes, this is for CSC. The performance for CSR is the same.
If C, M, and A are sparse MATLAB matrices (M is logical), then C(M)=A can take a week in MATLAB, if C and A are big enough (say 100 million entries in C and 50,000 in A), but it takes 7 seconds in GraphBLAS.
For that size of matrix I mean. Of course GraphBLAS is faster for smaller cases.
For C(I,J)=A, MATLAB uses two algorithms. The first one is decent but still slower than GraphBLAS; see https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/0618e95108801304170260bb5d9e[…]75812ff5ef/GraphBLAS/demo/html/DGX_Station/graphblas_demo2.html . When the matrix gets larger than roughly 4 million by 4 million, MATLAB switches to another method which is very slow; see https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/0618e95108801304170260bb5d9e[…]75812ff5ef/GraphBLAS/demo/html/DGX_Station/graphblas_demo2.html
Thanks! Once @Will Kimmerer has more GraphBLAS set up, we should be able to start plugging these as well. This would be really nice tohave.
On my laptop with this script in MATLAB R2020a:
% load GraphBLAS:
clear all
GrB (1) ;for k = [10:22]
n = 2^k ;
nz = 10 * n ;
d = nz / n^2 ;
d2 = d/10 ; tic
C0 = sprand (n, n, d) ;
M = logical (sprand (n, n, d2)) ;
mnz = nnz (M) ;
A = rand (mnz, 1) ;
t1 = toc ;
fprintf ('\nn: %d nnz(C): %d nnz(M) %d\n', n, nnz (C0), mnz) ;
fprintf ('create C: %g sec\n', t1) ; C1 = GrB (C0) ;
tic
C1 (M) = A ;
t3 = toc ;
fprintf ('C(M)=A in GrB : %g sec\n', t3) ; C2 = C0 ;
tic
C2 (M) = A ;
t2 = toc ;
fprintf ('C(M)=A in MATLAB : %g sec GrB speedup: %g\n', t2, t2/t3) ; assert (isequal (C1, C2)) ;
end
(edited)
The output is:
n: 1024 nnz(C): 10186 nnz(M) 1024
create C: 0.031555 sec
C(M)=A in GrB : 0.075309 sec
C(M)=A in MATLAB : 0.005268 sec GrB speedup: 0.0699518n: 2048 nnz(C): 20432 nnz(M) 2048
create C: 0.012005 sec
C(M)=A in GrB : 0.005219 sec
C(M)=A in MATLAB : 0.024435 sec GrB speedup: 4.68193n: 4096 nnz(C): 40908 nnz(M) 4096
create C: 0.034915 sec
C(M)=A in GrB : 0.002945 sec
C(M)=A in MATLAB : 0.114939 sec GrB speedup: 39.0285n: 8192 nnz(C): 81876 nnz(M) 8191
create C: 0.029753 sec
C(M)=A in GrB : 0.008706 sec
C(M)=A in MATLAB : 0.59389 sec GrB speedup: 68.2162n: 16384 nnz(C): 163789 nnz(M) 16384
create C: 0.071706 sec
C(M)=A in GrB : 0.009273 sec
C(M)=A in MATLAB : 2.52994 sec GrB speedup: 272.828n: 32768 nnz(C): 327633 nnz(M) 32767
create C: 0.147284 sec
C(M)=A in GrB : 0.014384 sec
C(M)=A in MATLAB : 12.4234 sec GrB speedup: 863.699n: 65536 nnz(C): 655309 nnz(M) 65536
create C: 0.32691 sec
C(M)=A in GrB : 0.025189 sec
C(M)=A in MATLAB : 65.9221 sec GrB speedup: 2617.1n: 131072 nnz(C): 1310677 nnz(M) 131070
create C: 0.806615 sec
C(M)=A in GrB : 0.055393 sec
(edited)
For the last case, MATLAB is still running but you can see the trend. For one large matrix, on my NVIDIA DGX Station, I ran C(M)=A in MATLAB and had to kill it a week later. GraphBLAS took 7 seconds.The small case for n =1024 is probably “warmup” time where the GrB library is being loaded, or maybe it’s the MATLAB object oriented overhead.
Oh, MATLAB finally finished the last case:
n: 131072 nnz(C): 1310677 nnz(M) 131070
create C: 0.806615 sec
C(M)=A in GrB : 0.055393 sec
C(M)=A in MATLAB : 276.215 sec **GrB speedup: 4986.46**
actually, those timings are conservative; I have another process running that’s taking a few cores. MATLAB is single threaded but GraphBLAS is multi-threaded. So the actual speedup with GraphBLAS will be higher than those numbers. They will be still higher on my 20-core DGX Station.
Basically, essentially all basic computations like A+B, AB, A’, [A B], C(I,J)=A, C=A(I,J), C = sum(A), C=sqrt(A), and so on are many times faster in GraphBLAS vs MATLAB, even when using the same syntax. Except C=AB in MATLAB R2021a, where the speedup is 1x or maybe 5x because MATLAB is using GraphBLAS v3 but my latest GraphBLAS n5 can sometimes be faster than my v3 code.
Marching along … MATLAB finished in 18 minutes what GraphBLAS can do in 0.07 seconds.
n: 262144 nnz(C): 2621396 nnz(M) 262142
create C: 1.69602 sec
C(M)=A in GrB : 0.071006 sec
C(M)=A in MATLAB : 1077.34 sec **GrB speedup: 15172.5**
(edited)
I shudder what Julia will look like- I’ll try it out later today.
That’s why people are filing PRs like: Change getindex algorithms on sparse matrices by bonevbs · Pull Request #40519 · JuliaLang/julia · GitHub
The GraphBLAS multi-threading will interfere with Julia’s multi-threading, but we’ll slowly get to that too.
1
These operations are actually a bit faster with native GraphBLAS. The MATLAB syntax C(M)=A is a little odd in my opinion, but it’s of course of historical importance. It requires A to be a vector of length nnz(M). The GraphBLAS GrB_assign does C=A where C, M, and A all have the same size. So the GraphBLAS native C=A is the same as the MATLAB C(M)=A(M). That means when I do the overloaded syntax of C(M)=A, I have to finagle the input vector A into a sparse matrix A to be able to call GrB_assign. That causes GraphBLAS to do about double the work it would normally do if it were doing the pure GrB_assign to compute C=A.I’d be very interested in see how Julia does. Does it have the same syntax C(I,J)=A for example? and C(M)=A? Or the equivalent?The @GrB MATLAB interface is easy to install in MATLAB; just be aware that you need to compile libgraphblas_renamed.so if you want to use MATLAB R2021a, since that version of MATLAB has its own copy of libmwgraphblas.so, and the 2 libraries conflict. So to use GraphBLAS v5 in MATLAB R2021a, I have to rename all my symbols away from GrB*. The user doesn’t see this; it’s all under the hood.For the C(I,J)=A demo and its timing results, see the GraphBLAS/GraphBLAS/graphblas_demo2.m script. For a more general demo, see GraphBLAS/GraphBLAS/graphblas_demo.m. You can see the results of these demos here, but the html doesn’t render in the github browser: https://github.com/DrTimothyAldenDavis/GraphBLAS/tree/stable/GraphBLAS/demo/html . You’d have to download the package and view the files there.
The PR : Change getindex algorithms on sparse matrices by bonevbs · Pull Request #40519 · JuliaLang/julia · GitHub is for the computation C=A(I,J), right? I use a meta algorithm that analyzes the problem, vector by vector, to choose the right methods. See: https://github.com/DrTimothyAldenDavis/GraphBLAS/blob/stable/Source/Template/GB_subref_template.c . Subreferencing (C=A(I,J)) is a lot simpler than assignment (C(I,J)=A). My internal code for C=A(I,J) is under 2KLOC, but C(I,J)=A takes about 12KLOC to do properly, as a meta algorithm with about 40 different algorithms inside.
Yes 40519 is C=A(I,J)
.
Yes, we do have C(I,J) = A and C(M) = A.
How does the multithreading of GraphBLAS interfere with Julia? I assume you mean that Julia is also parallel? If you’ve got a parallel application (Julia) that wants to call GraphBLAS in parallel, my library gives you control. You can tell me how many threads to use for each call of GraphBLAS.
Yes, the main worry is exactly that - and users will have to know these details on how many threads to give each (and of course having no idea what the right values are). It also interferes with regular BLAS threads for the same reason. For native Julia threading, it is composable -s o different Julia packages that all use threading are able to co-exist. For FFTW, we schedule its threads using the Julia scheduler (basically FFTW splits the work and then instead of launching its own pthreads, it hands them back to us). We hoped to do the same with OpenBLAS at some point.
Something like this: partr thread support for openblas · Issue #32786 · JuliaLang/julia · GitHub
I see. The # of threads I use is limited by what the caller gives me, but for small problems I select fewer threads via an internal heuristic so as not to see a parallel slowdown.
Step 1 is to get this all working together first - and then we’ll get to what to do about multi-threading(and mostl likely we just leave it to users to decide how many threads to give Julia and GraphBLAS).
Sounds like a plan. Threading has been an issue with GraphBLAS as well, in how it plays nice with applications that use it.The C API Committee for GraphBLAS (Tim Mattson, Scott McMillan, Aydin Buluc, Jose Moriera, Benjamin Brock) has recognized this and are working on a v2.0 spec that will create a GraphBLAS GrB_context. That context can be told how many threads to use (or at least mine will). Each GrB object lives in a context and can be moved between them. The idea for the GrB_Context is to simplify threading issues like this. The GrB_Context could also be used to control which GPUs to use (I’m working on the GPU version with Joe Eaton @ NVIDIA). which MPI processes, and so on.Currently I use the GrB_Descriptor to set the # of threads, or a use a global setting. This requires extensions not in the GraphBLAS C API. Some methods don’t have a GrB_Descriptor input so I have to rely on the global setting, but that is awkward if you call 2 methods (like GrB_Matrix_dup(&C,A) which copies a matrix) and if you want one call to use 2 threads and another 4, at the same time. You can’t do that currently, but you will with the GrB_Context.