[ANN] Finch.jl: Sparse and Structured Array Fusion

Finch.jl: Sparse and Structured Array Fusion

Hi everyone, I’m excited to release Finch 1.0.0! Finch is a Julia-to-Julia compiler for sparse or structured multidimensional arrays. Finch empowers users to write high-level array programs which are transformed behind-the-scenes into fast sparse code.

Why Finch.jl?

Finch was built to make sparse and structured array programming easier and more efficient. Finch.jl leverages compiler technology to automatically generate customized, fused sparse kernels for each specific
use case. This allows users to write readable, high-level sparse array programs without worrying about the performance of the generated code. Finch can automatically generate efficient implementations even for unique problems that lack existing library solutions.

Installation

At the Julia REPL, install the latest stable version by running:

julia> using Pkg; Pkg.add("Finch")

Documentation

Read the docs here!

Quickstart

julia> using Finch

# Create a sparse tensor
julia> A = Tensor(CSCFormat(), [1 0 0; 0 2 0; 0 0 3])
3×3 Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
 1.0  0.0  0.0
 0.0  2.0  0.0
 0.0  0.0  3.0

# Perform a simple operation
julia> B = A + A
3×3 Tensor{DenseLevel{Int64, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
 2.0  0.0  0.0
 0.0  4.0  0.0
 0.0  0.0  6.0

Sparse and Structured Tensors

Finch supports most major sparse formats (CSR, CSC, DCSR, DCSC, CSF, COO, Hash, Bytemap). Finch also allows users to define their own sparse formats with a parameterized format language.

CSC_matrix = Tensor(CSCFormat())
CSR_matrix = swizzle(Tensor(CSCFormat()), 2, 1)
CSF_tensor = Tensor(CSFFormat(3))
custom_tensor = Tensor(Dense(Sparse(Dense(Element(1.0)))), 3, 4, 2)

Finch also supports a wide variety of array structure beyond sparsity. Whether you’re dealing with custom background (zero) values, run-length encoding, or matrices with special structures like banded or triangular matrices, Finch’s compiler can understand and optimize various data patterns and computational rules to adapt to the structure of data.

Examples:

Finch supports many high-level array operations out of the box, such as +, *, maximum, sum, map, broadcast, and reduce.

julia> using Finch

# Define sparse tensor A
julia> A = Tensor(Dense(SparseList(Element(0.0))), [0 1.1 0; 2.2 0 3.3; 4.4 0 0; 0 0 5.5])

# Define sparse tensor B
julia> B = Tensor(Dense(SparseList(Element(0.0))), [0 1 1; 1 0 0; 0 0 1; 0 0 1])

# Element-wise multiplication
julia> C = A .* B

# Element-wise max
julia> C = max.(A, B)

# Sum over rows
julia> D = sum(C, dims=2)

For situations where more complex operations are needed, Finch supports an @einsum syntax on sparse and structured tensors.

julia> @einsum E[i] += A[i, j] * B[i, j]

julia> @einsum F[i, k] <<max>>= A[i, j] + B[j, k]

Finch even allows users to fuse multiple operations into a single kernel with lazy and compute. The lazy function creates a lazy tensor, which is a symbolic representation of the computation. The compute function evaluates the computation.
Different optimizers can be used with compute, such as the state-of-the-art Galley optimizer, which can adapt to the
sparsity patterns of the inputs.

julia> using Finch, BenchmarkTools

julia> A = fsprand(1000, 1000, 0.1); B = fsprand(1000, 1000, 0.1); C = fsprand(1000, 1000, 0.0001);

julia> A = lazy(A); B = lazy(B); C = lazy(C);

julia> sum(A * B * C)

julia> @btime compute(sum(A * B * C));
  263.612 ms (1012 allocations: 185.08 MiB)

julia> @btime compute(sum(A * B * C), ctx=galley_scheduler());
  153.708 μs (667 allocations: 29.02 KiB)

How it Works

Finch first translates high-level array code into FinchLogic, a custom intermediate representation that captures operator fusion and enables loop ordering optimizations. Using advanced schedulers, Finch optimizes FinchLogic and lowers it to FinchNotation, a more refined representation that precisely defines control flow. This optimized FinchNotation is then compiled into highly efficient, sparsity-aware code. Finch can specialize to each combination of sparse formats and algebraic properties, such as x * 0 => 0, eliminating unnecessary computations in sparse code automatically.

Learn More

The following manuscripts provide a good description of the research behind Finch:

Finch: Sparse and Structured Array Programming with Control Flow.
Willow Ahrens, Teodoro Fields Collin, Radha Patel, Kyle Deeds, Changwan Hong, Saman Amarasinghe.

Looplets: A Language for Structured Coiteration. CGO 2023.
Willow Ahrens, Daniel Donenfeld, Fredrik Kjolstad, Saman Amarasinghe.

Beyond Finch

The following research efforts use Finch:

SySTeC: A Symmetric Sparse Tensor Compiler.
Radha Patel, Willow Ahrens, Saman Amarasinghe.

The Continuous Tensor Abstraction: Where Indices are Real.
Jaeyeon Won, Willow Ahrens, Joel S. Emer, Saman Amarasinghe.

Galley: Modern Query Optimization for Sparse Tensor Programs. Galley.jl.
Kyle Deeds, Willow Ahrens, Magda Balazinska, Dan Suciu.

Contributing

Contributions are welcome! Please see our contribution guidelines for more information.

72 Likes

Don’t know the field well, but its very amazing, competing against some of the best funded and smartest teams on the planet and getting such general and SOTA results! Congrats to you and your team!

1 Like

This looks very cool, and congrats on the release!

As another person that doesn’t know the field well, I’m not immediately grokking quite when someone would need Finch. Julia’s stdlib includes SparseArrays with some specialized sparse types and methods. Is the use case for Finch something like: what if you want to write an algorithm over sparse/structured arrays, but your special structure isn’t a type already in SparseArrays? Then Finch can automatically generate efficient methods for functions like + and * specialized to the structure of your arrays?

4 Likes

Hi there, thanks for the question! Here’s three good reasons someone might want to use Finch versus SparseArrays:

  1. Finch supports N-dimensional sparse tensors. SparseArrays only supports matrices. For example, the following kernel would be impossible to write in SparseArrays:
julia> using Finch

julia> N = 100
100

julia> A = Tensor(CSFFormat(3), fsprand(N, N, N, 0.001)); B = rand(N, N); C = rand(N, N);

julia> ndims(A)
3

julia> @einsum D[i, j] += A[i, k, l] * B[j, k] * C[j, l]

  1. Finch can fuse multiple operations together into a single, compiled kernel. Since sparse arrays act like filters in a computation, this can result in asymptotic benefits. For instance:
julia> using BenchmarkTools, SparseArrays, Finch

julia> A = sprand(1000, 1000, 0.0001); B = rand(1000, 1000); C = rand(1000, 1000);

julia> @btime $A .* ($B * $C);
  8.312 ms (21 allocations: 38.16 MiB)

julia> @btime Finch.fused($A, $B, $C) do A, B, C
          A .* (B * C)
       end;
  719.542 μs (656 allocations: 7.65 MiB)

  1. Finch is capable of supporting a much wider range of operations and format specializations than SparseArrays. For example, the following is a single step of a shortest-path algorithm:
julia> using Finch

julia> A = set_fill_value!(fsprand(5, 5, 0.5), Inf)
5×5 Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{Inf, Float64, Int64, Vector{Float64}}}}:
 Inf         0.285503   0.546055  Inf        Inf
 Inf        Inf        Inf        Inf        Inf
 Inf        Inf        Inf        Inf         0.467886
  0.495763  Inf         0.952137   0.323355   0.265936
 Inf        Inf        Inf         0.50558    0.335587

julia> @einsum B[i, j] <<min>>= A[i, k] + A[k, j]
5×5 Tensor{SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{Inf, Float64, Int64, Vector{Float64}}}}}:
 Inf        Inf        Inf       Inf         1.01394
 Inf        Inf        Inf       Inf        Inf
 Inf        Inf        Inf        0.973466   0.803473
  0.819118   0.781266   1.04182   0.64671    0.589291
  1.00134   Inf         1.45772   0.828935   0.671174

You could also write a format which is specialized to store exactly one nonzero in each row (here’s a permutation matrix):

julia> A = Tensor(Dense(SparsePoint(Element(0.0))), sparse(collect(1:5), randperm(5), ones(5)))
5×5 Tensor{DenseLevel{Int64, SparsePointLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}:
 1.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0

29 Likes

awesome :grinning:

Thank you for this package ! It seems robust and well designed.

Could it be a replacement to SparseSuite ? I have noticed “SparseArrays” in the dependencies (weakdeps, extensions, extras). Does it mean that what Finch.jl does is to translate it to (well designed) intermediate julia code, though still using SparseSuite behind the scenes ?

2 Likes

Finch seems to be a replacement for SparseArrays and thus SparseSuite (superset of capabilities and formats). And if that’s correct, I propose dropping them from Julia, making Julia [binaries] smaller.

I see it allows converting the 2D SparseMatrixCSC and 1D SparseVector from SparseArrays to Finch.Tensor, so in new code you might use just Finch.Tensor for everything.

And it also allows converting back (only failing if using features SparseArrays can’t support, like non-zero fill), though I’m guessing you wouldn’t need this except for compatibility with old code expecting such:

"""
    SparseMatrixCSC(arr::Union{Tensor, SwizzleArray})

Construct a sparse matrix from a tensor or swizzle. May reuse the underlying storage if possible.
"""
function SparseArrays.SparseMatrixCSC(arr::Union{Tensor, SwizzleArray})
..

"""
    sparse(arr::Union{Tensor, SwizzleArray})

Construct a SparseArray from a Tensor or Swizzle. May reuse the underlying storage if possible.
"""

I think, but not sure, that Finch is not just a superset of capabilities, but always faster for the 1- and 2D case too. It probably doesn’t give any advantage when using the same format at least, though possibly some other formats apply to also at least 2D?

I assume it’s always faster, probably not hard to benchmark for those cases, but noticed:
“Julia code is nototoriously [typo for notoriously] fussy to benchmark. …”

Can you also explain what @kwdef mutable struct VirtualSparseMatrixCSC is about?

I noticed you have a weak dependency on SparseArrays and has some SparseArraysExt:

Finch first translates high-level array code into FinchLogic, […] Using advanced schedulers, Finch optimizes FinchLogic and lowers it to FinchNotation, a more refined representation that precisely defines control flow.

Seems very cool, I suppose all transparent to users, they don’t need to klnow or care much about the details.

1 Like

Finch does not have a dependency on SparseArrays, and does not use any other sparse library under the hood. Finch automatically generates the sparse code from scratch, in Julia, meaning that Finch will work well with user-defined Julia structs. We use SparseArrays in the Finch test suite, and we define a compatibility layer in an extension package so that when users have already loaded both packages, they can work together harmoniously.

11 Likes

Sure! This is related to Finch’s staging (compile-and-run) mechanism. Finch uses the Julia type system to communicate the program and array formats through a generated function that compiles it. There’s more on that in this doc

3 Likes

Since you mention shortest path algorithms, is it conceivable that Graphs.jl and similar packages that currently use SparseArrays (without the solvers) will be able to migrate to using Finch.jl to do their sparse array operations? What would be the downsides, if any?

3 Likes

Is this a sort of lazy evaluation, which enables saving temporaries?

As a bystander, this sounds quite promising! I wonder how much of the work is/can be done at precompile time though?

3 Likes

For graph algorithms, Finch has a similar utility to SuiteSparseGraphBlas, as it allows you to express many graph algorithms in the language of linear algebra. This means that algorithms which involve many bulk data-parallel operations on edges (such as triangle counting or bellman-ford) would be good candidates. Algorithms like Dijkstra’s algorithm are currently less suited towards linear algebraic approaches, as they involve many serialized operations on individual vertex neighborhoods.

In general, if SparseArrays works great for your use case (+ and * on matrices and vectors), there is probably no need to use Finch. If you find there are situations where SparseArrays does not have solutions (operations involving more than two matrices, operators beyond + and *, datatypes beyond float, formats beyond CSC), then Finch’s generality might come in handy.

A few improvements are on the way that would improve Finch’s utility:

  1. We’re adding support for precompilation and caching to disk. Finch can currently only precompile kernels in it’s own precompile script. Since Finch uses a diverse set of metaprogramming approaches, which are called in a variety of situations, it has been a challenge to support precompilation in a way that is transparent to downstream packages. It would be easier if we only used generated functions, but Finch also uses eval when it considers the runtime sparsity pattern of inputs to make implementation decisions. Those with expertise are welcome to weigh in on precompilation and caching
  2. We’re working on adding multicore parallelism. This involves modifications to the whole stack, from building concurrent tensor formats to supporting load-balancing in the compiler, all the way up to making the optimizers aware of parallelism, so this has been a long road, but we’re hopeful we can produce an efficient multicore solution soon.
8 Likes

Reusing temporaries is one benefit of lazy evaluation, but there’s a bigger benefit in sparse kernels: zeros allow us to prune computations in other expressions. For example, in C .* (A * B) where A and B are dense, and C is sparse, we don’t need to compute the whole matrix multiply A * B. Instead we can compute only the dot products sum_k(A_ik * B_kj) corresponding to i and j where C_ij is nonzero.

12 Likes

Very interesting package!!!

I’m a non-expert and am having a bit of trouble starting with it. A MWE on how to go from SparseArrays.jl to Finch.jl would be most useful on the GitHub page. Just a suggestion. Something like:

julia> using SparseArrays, Finch
julia> n=10^6;   
julia> x, y = (sprand(n, n, 10/n, k->rand(-1:0.0001:1,k)) for _=1:2);
julia> X, Y = (Finch.Tensor(CSCFormat(),t) for t∈(x,y));
julia> Base.summarysize.((x,X)) ./10^6
(168.144512, 325.111552)
julia> @time x+y;   @time X+Y;   @time x*y;   @time X*Y;   
  0.492094 seconds (10 allocations: 312.750 MiB, 6.12% gc time)
 18.375342 seconds (571 allocations: 3.394 GiB, 1.98% gc time)
 12.074120 seconds (13 allocations: 1.662 GiB, 1.41% gc time)
112.855208 seconds (604 allocations: 20.310 GiB, 0.86% gc time, 0.00% compilation time)

On that note, why are the times for Finch so much higher? Am I using it incorrectly? Shouldn’t operations be faster due to multithreading?

1 Like

Thanks for your feedback! I’ll add a TODO for migrating from SparseArrays.

On the performance: The Finch high-level scheduler (which handles operations like + and *) is experimental, and built for generality. We’re currently working on optimizing the high-level scheduler to use more specialized techniques as well as adding parallelism. In the meantime, it may be better to write Finch IR directly:

julia> function faster_mul(A, B)
           w = Tensor(SparseByteMap(Element(0.0)))
           C = Tensor(CSCFormat())
           @finch begin
               C .= 0
               for j=_
                   w .= 0
                   for k=_, i=_; w[i] += A[i, k] * B[k, j] end
                   for i=_; C[i, j] = w[i] end
               end
               return C
           end
           return C
       end
@time x * y; @time X * Y; @time faster_mul(X, Y);
  2.599353 seconds (13 allocations: 1.663 GiB, 3.37% gc time)
 14.993908 seconds (601 allocations: 20.312 GiB, 3.12% gc time, 0.00% compilation time)
  4.000148 seconds (2.21 M allocations: 5.690 GiB, 2.28% gc time)
5 Likes