I came across this post and, not being an expert in either NumPy or the how it would be done in Julia, was curious if anyone here could weigh in with idiomatic Julia versions of some of these examples for comparison.
Note that the author also promises a follow-up piece in which a “better suggestion” will be revealed.
Here is how I would do the first three examples
Example 1
Say
A
is a5×5
matrix,x
is a length-5 vector, and you want to find the vector y such thatAy=x
A = rand(5,5)
x = rand(5)
y = A\x
Example 2
Say
A
is a stack of 1005×5
matrices, given as a100×5×5
array. And sayx
is a stack of 100 length-5 vectors, given as a100×5
array. And say you want to solveAᵢyᵢ=xᵢ
for1≤i≤100
A = [rand(5,5) for i in 1:100]
x = [rand(5) for i in 1:100]
y = [A[i]\x[i] for i in 1:100]
Example 3
Suppose:
A
is aK×L×M
arrayB
is aL×N
arrayC
is aK×M
arrayAnd say that for each
k
andn
, you’d like to compute the mean over theL
andM
> dimensions. That is, you want
D_kn = 1/(LM) × ∑_lm A_klm B_ln C_km.
K = 2
L = 3
M = 4
N = 5
A = rand(K,L,M)
B = rand(L,N)
C = rand(K,M)
D = [[1/(L * M) * sum(A[k,l,m] * B[l,n] * C[k,m] for l in 1:L for m in 1:M) for k in 1:K ] for n in 1:N]
I don’t like, too.
I was actually curious about the best equivalent for this part, because they say:
But nowadays, everything is GPU and if you’ve got big arrays, you probably don’t want to use loops in any language.
That made me curious if Julia breaks the “loops are bad” mold here too. I’m not very familiar with the KernelAbstractions.jl / CUDA.jl side of things though, so I don’t know how much effort it would be to write this in a GPU-friendly way. (A lot of the complaints in the article are about the ergonomics of doing things, stuff being unnecessarily hard, rather than about being able to do it at all.)
They explicitly say “given as a 100×5×5
array”, so a rand(100, 5, 5)
would be closer to the spec. (And similarly for x
.)
The last example comes out really neat with Tullio.jl or similar, but it’s nice to see that even the native Julia version comes out pretty nice and readable.
I guarantee the “better suggestion” is only “better” because it’s less powerful and can be much worse when scrutinized in more ways. Operations on arrays are always going to be less intuitive than explicit iteration, but the complicated stuff is NOT something you ever want to be writing out a loop for anyway. Taking the example of the (10,2,3,40)
shape resulting from a broadcasting of 3-wide and 1x2 indices, I prefer how Julia forces us to acknowledge they are broadcast together size(A[:, CartesianIndex.([1 2 3], [1,2]), :])
because arrays of indices aren’t treated differently from ranges, but let’s be real, do we want to write out the 4-level loop, preallocate the correct array size, and order the indices correctly in either Python or Julia? We almost always don’t need to anyway, and array syntax can intuitively handle the simpler cases just fine. The last example is just a mockery of someone using Einstein summation instead of the “obvious” suboptimal loops.
This can be simplified to just y = A .\ x
. But as @digital_carver pointed out, I think the real challenge here is to come up with a syntax for batched linsolves that would dispatch to batched GPU kernels.
One option here is to use a BlockDiagonalMatrix
Actually, taking a quick look at cuBLAS it looks like an array of pointers to arrays is the expected data structure for batched operations: 1. Introduction — cuBLAS 12.9 documentation. So broadcasting over arrays of arrays may in fact be exactly the right syntax.
That said, I wonder how relevant batching actually is for linsolves. From a quick scroll through the cuSOLVER docs, I don’t see a lot of support for it.
EDIT: There’s the gemvStridedBatched
function for strided batches (rand(5, 5, 100)
et cetera), and the only batched function for dense matrices in cuSOLVER (syev
) also expects strided batches, so the above is probably too optimistic about arrays of arrays and broadcasting. The ideal syntax would have first class support for strided batches.
Maybe JAX / Reactant.jl are able to optimize this kind of thing?
EDIT: at the moment, it seems Reactant.jl can’t do linear solves
julia> using Reactant
julia> A = Reactant.to_rarray(rand(10, 10));
julia> b = Reactant.to_rarray(rand(10));
julia> f = @compile \(A, b)
ERROR: Scalar indexing is disallowed.
Invocation of getindex(::TracedRArray, ::Union{Int, TracedRNumber{Int}}) resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.
If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] errorscalar(op::String)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:151
[3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:124
[4] assertscalar(op::String)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:112
[5] getindex(a::Reactant.TracedRArray{Float64, 2}, index::Int64)
@ Reactant.TracedRArrayOverrides ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:114
[6] getindex
@ ./subarray.jl:343 [inlined]
[7] _mapreduce(f::typeof(iszero), op::typeof(&), ::IndexLinear, A::SubArray{…})
@ Base ./reduce.jl:431
[8] _mapreduce_dim(f::Function, op::Function, ::Base._InitialValue, A::SubArray{…}, ::Colon)
@ Base ./reducedim.jl:337
[9] mapreduce
@ ./reducedim.jl:329 [inlined]
[10] overloaded_all
@ ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:821 [inlined]
[11] _all(f::Function, x::SubArray{Reactant.TracedRNumber{…}, 1, Reactant.TracedRArray{…}, Tuple{…}, true}, dims::Function)
@ Reactant ~/.julia/packages/Reactant/cTiTU/src/Overlay.jl:168
[12] all
@ ./reducedim.jl:995 [inlined]
[13] _istril
@ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1362 [inlined]
[14] istril
@ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1355 [inlined]
[15] istril
@ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1354 [inlined]
[16] \
@ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1122 [inlined]
[17] \(none::Reactant.TracedRArray{Float64, 2}, none::Reactant.TracedRArray{Float64, 1})
@ Reactant ./<missing>:0
[18] getproperty
@ ./Base.jl:49 [inlined]
[19] size
@ ~/.julia/packages/Reactant/cTiTU/src/TracedRArray.jl:425 [inlined]
[20] \
@ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/generic.jl:1120 [inlined]
[21] call_with_reactant(::typeof(\), ::Reactant.TracedRArray{Float64, 2}, ::Reactant.TracedRArray{Float64, 1})
@ Reactant ~/.julia/packages/Reactant/cTiTU/src/utils.jl:0
[22] make_mlir_fn(f::typeof(\), args::Tuple{…}, kwargs::@NamedTuple{}, name::String, concretein::Bool; toscalar::Bool, return_dialect::Symbol, args_in_result::Symbol, construct_function_without_args::Bool, do_transpose::Bool, input_shardings::Nothing, output_shardings::Nothing, runtime::Val{…}, verify_arg_names::Nothing, argprefix::Symbol, resprefix::Symbol, resargprefix::Symbol, num_replicas::Int64, optimize_then_pad::Bool)
@ Reactant.TracedUtils ~/.julia/packages/Reactant/cTiTU/src/TracedUtils.jl:307
[23] make_mlir_fn
@ ~/.julia/packages/Reactant/cTiTU/src/TracedUtils.jl:237 [inlined]
[24] compile_mlir!(mod::Reactant.MLIR.IR.Module, f::typeof(\), args::Tuple{…}, callcache::Dict{…}, sdycache::Dict{…}; fn_kwargs::@NamedTuple{}, optimize::Bool, cudnn_hlo_optimize::Bool, shardy_passes::Symbol, no_nan::Bool, transpose_propagate::Symbol, reshape_propagate::Symbol, optimize_communications::Bool, assert_nonallocating::Bool, backend::String, raise::Bool, raise_first::Bool, donated_args::Symbol, optimize_then_pad::Bool, runtime::Val{…}, kwargs::@Kwargs{})
@ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:1156
[25] compile_mlir!
@ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:1110 [inlined]
[26] compile_xla(f::Function, args::Tuple{…}; client::Nothing, serializable::Bool, kwargs::@Kwargs{…})
@ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2929
[27] compile_xla
@ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2911 [inlined]
[28] compile(f::Function, args::Tuple{…}; sync::Bool, kwargs::@Kwargs{…})
@ Reactant.Compiler ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2982
[29] top-level scope
@ ~/.julia/packages/Reactant/cTiTU/src/Compiler.jl:2135
Some type information was truncated. Use `show(err)` to see complete types.
I assume GitHub - mcabbott/Tullio.jl: ⅀ would address the issues mentioned in the blog post.
Finch.jl also comes to mind, like Tullio.jl it has an @einsum
macro.
Einsum syntax only addresses forward BLAS-type operations. The blog post opens with a gripe about the lack of a sane interface for batched linsolve, which is an inverse, LAPACK-type operation. Granted, this is quite academic because batched linear solvers don’t seem very common (as mentioned above, the only batched dense solver in cuSOLVER is the symmetric/hermitian eigensolver), but it’s what the author seems to want.
Given the array shape the author uses here, I would probably write this using eachslice
if I wanted to avoid loops:
julia> let A = rand(5, 5, 100)
x = rand(5, 100)
y = similar(x)
Ab = eachslice(A, dims=3)
xb = eachslice(x, dims=2)
yb = eachslice(y, dims=2)
yb .= Ab .\ xb
y
end
5×100 Matrix{Float64}:
1.26329 0.846035 … 2.02272
0.408383 -1.11563 2.82224
0.180778 0.364129 4.49043
0.239752 0.345266 -4.85347
-0.749327 0.263876 -0.945626
For completeness, since you’re preallocating, the in-place version would be
ldiv!.(yb, lu!.(Ab), xb) # explicit factorization required
Could this not simply be a classic case of “numpy has no way of distinguishing vectors of arrays from higher-dimensional arrays”?You cannot express this in numpy, so it must be 3d.
We landed it on EnzymeJax side just 2 days back . I am working on adding it to Reactant feat: factorizations by avik-pal · Pull Request #1234 · EnzymeAD/Reactant.jl · GitHub.
But once the frontend code lands it will support all forms of fancy batching. For GPU and TPU we already use nice batched solvers. On CPU we currently use a loop, but eventually the plan is to lower to MKL that has batched CPU solvers