There is so much great stuff happening in RISC-V land

Ok, let’s make things a bit more concrete. On a Banana Pi BPI-F3, which also has support for the vector v1.0 extension unlike the processor you linked, a 4096x4096 matrix-matrix multiplication yields 7.3 GFLOPS with OpenBLAS using the specialised vector kernel (riscv64_zvl256b):

julia> versioninfo()
Julia Version 1.12.0-DEV.1628
Commit aa05c98998* (2024-11-13 19:09 UTC)
Platform Info:
  OS: Linux (riscv64-unknown-linux-gnu)
  CPU: 8 × Spacemit(R) X60
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.1 (ORCJIT, generic-rv64)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
 57.756208 seconds (33 allocations: 768.024 MiB, 0.61% gc time)
7.275193854554525e9

julia> using OpenBLAS_jll

julia> strip(unsafe_string(ccall(((:openblas_get_config64_), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.28  USE64BITINT DYNAMIC_ARCH NO_AFFINITY riscv64_zvl256b MAX_THREADS=512"

On my 10-year old laptop with x86_64 Haswell CPU instead I get

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
  2.947837 seconds (33 allocations: 768.001 MiB, 0.27% gc time)
1.4948372787025372e11

Note the 20x difference in the total time to finish the 3 mat muls and the FLOPS delivered.

Having played with some riscv64 boards lately I can’t say I’m very impressed by their performance at the moment for numerical workloads (and let’s not talk about compilation latency, I started a new build of LLVM 20 hours ago and it’s still at 80%), and I wouldn’t expect a major leap from the laptop you found.

With this I’m not saying things can’t improve, I’m very much hopeful they will because I find this architecture interesting also because of its openness aspect, but reality at the moment is very bleak as much as vendors sprinkle “high performance” around. Current riscv64 boards are usable to play around with them, but for heavy numerical workloads are still very disappointing. Unless you have loads of cores and use them as cheap and efficient accelerators (competitive with high-powered GPUs when you factor in cost and/or energy usage) rather than main CPU, which is what a few vendors are doing already.

8 Likes

What’s the latency of the vfmacc on the cpu? That’s a bad enough result that something likely is going very wrong. Also, how does it perform with a single BLAS thread? I wonder if the chip just has incredibly slow inter-core communication.

Oh, I’ve got access to one

Let’s try:

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
153.067245 seconds (33 allocations: 768.024 MiB, 0.43% gc time)
2.72432232993256e9

julia> versioninfo()
Julia Version 1.12.0-DEV.1628
Commit aa05c98998* (2024-11-13 19:09 UTC)
Platform Info:
  OS: Linux (riscv64-unknown-linux-gnu)
  CPU: 64 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.1 (ORCJIT, generic-rv64)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)

MacBook M1:

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
  2.432995 seconds (33 allocations: 768.001 MiB, 0.46% gc time)
1.7754504829368015e11

julia> versioninfo()
Julia Version 1.12.0-DEV.1874
Commit 49428e8fa2* (2025-01-10 23:19 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin23.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m1)
  GC: Built with stock GC
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Very much not competitive.

1 Like

Back to the Banana Pi BPI-F3:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> BLAS.get_num_threads() # check the setting for good measure
1

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
192.494952 seconds (33 allocations: 768.024 MiB, 0.18% gc time)
2.15950419971436e9

It’s 3.3x slower, ~the CPU has 8 cores though~ OpenBLAS was using 4 threads above, so there’s indeed a bit of degradation (this system also has 2 cores always busy by the root user, so scaling up to all the 8 cores is basically never possible)

1 Like

What about threads=1 with @time LinearAlgebra.peakflops(512;ntrials=20)? For me on a Ryzen 3600, I get 5.1e10 with OpenBlas and 1.3e11 on MKL. I suspect this may close the difference since the chip on the bannana Pi has a pretty small (512kb) L2 cache, doesn’t have an L3 cache at all, and only has a 32 bit memory bus (compared to 128 bit for a normal desktop with dual channel ram)

julia> @time LinearAlgebra.peakflops(512;ntrials=20)
  4.088654 seconds (518 allocations: 80.038 MiB, 31.34% gc time, 4.41% compilation time: 100% of which was recompilation)
2.162252751085078e9

Similar FLOPS as above.

1 Like

That’s fascinating… I’m pretty sure this means that they only have a single scalar fma unit and that their vector instructions just get scalarized…

julia> 1000000000/(2.162252751085078e9) # 1ghz/(2.1e9 flop/s)
0.4624806232750415 # 0.5 flop per cycle or 1 fma per 4 cycles

I guess that’s probably the cheapest and easiest way to support the vector instruction set… OTOH, if they actually get a 256 bit fma unit (along with a 256 bit load/store unit), there’s a free 4x speedup.

2 Likes

The vendor claims 256-bit vector bandwidth.

Side note, when going to single-precision (still with a single BLAS thread) the FLOPS more than double which is a bit surprising:

julia> @time LinearAlgebra.peakflops(512; ntrials=20, eltype=Float32)
  0.947790 seconds (203 allocations: 40.005 MiB, 3.42% gc time)
6.151364582705779e9
1 Like

Oh, I wonder if they have a 256 bit data bus but only 32 bit multipliers so they have to run fp64 through the multiplier 3 or 4 times?

One other question: Does Octavian.jl work on this? I forget whether it’s x86 only or not…

It does work, but it’s ~2x slower than OpenBLAS:

$ OPENBLAS_NUM_THREADS=4 ./julia -t4 -q
julia> using LinearAlgebra, Octavian

julia> T = Float32; N = 512; A = randn(T, N, N); B = randn(T, N, N); C = similar(A); @time matmul!(C, A, B); @time mul!(C, A, B); # this is slower due to large compilation latency
102.407832 seconds (11.59 M allocations: 560.755 MiB, 0.84% gc time, 100.20% compilation time)
  2.659912 seconds (634.12 k allocations: 30.912 MiB, 99.31% compilation time)

julia> T = Float32; N = 512; A = randn(T, N, N); B = randn(T, N, N); C = similar(A); @time matmul!(C, A, B); @time mul!(C, A, B);
  0.024793 seconds
  0.013732 seconds

julia> BLAS.set_num_threads(1)

julia> T = Float32; N = 512; A = randn(T, N, N); B = randn(T, N, N); C = similar(A); @time matmul_serial!(C, A, B); @time mul!(C, A, B); # this is slower due to large compilation latency
  0.228132 seconds (31.38 k allocations: 1.537 MiB, 62.78% compilation time)
  0.044581 seconds

julia> T = Float32; N = 512; A = randn(T, N, N); B = randn(T, N, N); C = similar(A); @time matmul_serial!(C, A, B); @time mul!(C, A, B);
  0.091448 seconds
  0.044410 seconds

Edit: Octavian.jl becomes more competitive at larger sizes, but still ~30% slower:

julia> T = Float32; N = 6144; A = randn(T, N, N); B = randn(T, N, N); C = similar(A); @time matmul!(C, A, B); @time mul!(C, A, B);
 38.011257 seconds
 29.625035 seconds
1 Like

Btw this branch should work with llvm 20, Commits · Zentrik/julia · GitHub.

Oh, I see how my message was confusing, but 20 was the number of hours, not the version of LLVM.

1 Like

There are many things at play. a) memory bandwidth b) Cache sizes c) the actual CPU throughput.

The BPI-F3 only has 1MB cache. It fits a 350 * 350 matrix, but then you also need to store the results, temporaries…

Haswell has 68.3 GB/s peak memory bandwidth, the BPI-F3 has 10.6GB/s peak memory bandwidth.

2 Likes

I am writing this on a windows machine so won’t copy and paste the output but tried the

julia> @time LinearAlgebra.peakflops(4096; ntrials=3)

command on a Raspberry PI 4 and 5 both with 8 Gig Ram and Julia 1.11.2 on the Raspberry PI OS.

the timing was:
RPI 5 - 16.1 s
RPI 4 - 37.9 s

2 Likes

:astonished: So RISC-V processors have good performance considering their low price.

I mean, a Raspberry PI isn’t exactly my reference for a powerful processor for heavy numerical workloads, if that’s for you, good, but then we wildly disagree on the definition of “powerful” and that’s it. You can do lots of useful things on a Raspberry PI (I once used it to set up an NTP server in a private network and it did the job great), but I wouldn’t personally use it as main workstation for, say, solving differential equations, if I need to get back results in a finite amount of time.

1 Like