Ok, let’s make things a bit more concrete. On a Banana Pi BPI-F3, which also has support for the vector v1.0 extension unlike the processor you linked, a 4096x4096 matrix-matrix multiplication yields 7.3 GFLOPS with OpenBLAS using the specialised vector kernel (riscv64_zvl256b
):
julia> versioninfo()
Julia Version 1.12.0-DEV.1628
Commit aa05c98998* (2024-11-13 19:09 UTC)
Platform Info:
OS: Linux (riscv64-unknown-linux-gnu)
CPU: 8 × Spacemit(R) X60
WORD_SIZE: 64
LLVM: libLLVM-19.1.1 (ORCJIT, generic-rv64)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
57.756208 seconds (33 allocations: 768.024 MiB, 0.61% gc time)
7.275193854554525e9
julia> using OpenBLAS_jll
julia> strip(unsafe_string(ccall(((:openblas_get_config64_), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.28 USE64BITINT DYNAMIC_ARCH NO_AFFINITY riscv64_zvl256b MAX_THREADS=512"
On my 10-year old laptop with x86_64 Haswell CPU instead I get
julia> @time LinearAlgebra.peakflops(4096; ntrials=3)
2.947837 seconds (33 allocations: 768.001 MiB, 0.27% gc time)
1.4948372787025372e11
Note the 20x difference in the total time to finish the 3 mat muls and the FLOPS delivered.
Having played with some riscv64 boards lately I can’t say I’m very impressed by their performance at the moment for numerical workloads (and let’s not talk about compilation latency, I started a new build of LLVM 20 hours ago and it’s still at 80%), and I wouldn’t expect a major leap from the laptop you found.
With this I’m not saying things can’t improve, I’m very much hopeful they will because I find this architecture interesting also because of its openness aspect, but reality at the moment is very bleak as much as vendors sprinkle “high performance” around. Current riscv64 boards are usable to play around with them, but for heavy numerical workloads are still very disappointing. Unless you have loads of cores and use them as cheap and efficient accelerators (competitive with high-powered GPUs when you factor in cost and/or energy usage) rather than main CPU, which is what a few vendors are doing already.