Building OpenBLAS for PowerPC Defaults to Power8

I am trying to get OpenBLAS to compile for Power10. However, it seems OpenBLAS is defaulting to Power8 compilation, which is ~2x slower than the implementation for Power10 as it has MMA support for BLAS 3 routines like DGEMM.

I believe I have identified a prime suspect in julia/Make.inc which forces Power8 without any checks for higher processors:

# If we are running on powerpc64le or ppc64le, set certain options automatically
ifneq (,$(filter $(ARCH), powerpc64le ppc64le))
JCFLAGS += -fsigned-char
OPENBLAS_DYNAMIC_ARCH:=0
OPENBLAS_TARGET_ARCH:=POWER8
BINARY:=64
# GCC doesn't do -march= on ppc64le
MARCH=
endif

I have tried overwriting OPENBLAS_TARGET_ARCH=POWER10 with the Make.user option as well as writing it into the make command without luck.

Looking at the logs generated for OpenBLAS shown the following with the above variable attempted to be overwritten (at juliabuild_dir/usr/logs/OpenBLAS/OpenBLAS.log.gz), it still says it is using POWER8

 ---> flags+=(TARGET=POWER8)

The following code snippet appears in the log file (from the OpenBLAS source? can’t find it), but also seems to another problem as the last elif is setting TARGET=POWER8 without higher processor checks

# On Intel and most aarch64 architectures, engage DYNAMIC_ARCH.
# When using DYNAMIC_ARCH the TARGET specifies the minimum architecture requirement.
if [[ ${proc_family} == intel ]]; then
    flags+=(DYNAMIC_ARCH=1)
    # Before OpenBLAS 0.3.13, there appears to be a miscompilation bug with `clang` on setting `TARGET=GENERIC`
    # As that is the case, we're just going to be safe and only use `TARGET=GENERIC` on 0.3.13+
    if [ ${version_patch} -gt 12 ]; then
        flags+=(TARGET=GENERIC)
    else
        flags+=(TARGET=)
    fi
elif [[ ${target} == aarch64-* ]] && [[ ${bb_full_target} != *-libgfortran3* ]]; then
    flags+=(TARGET=ARMV8 DYNAMIC_ARCH=1)
# Otherwise, engage a specific target
elif [[ ${bb_full_target} == aarch64*-libgfortran3* ]]; then
    # Old GCC versions, with libgfortran3, can't build for newer
    # microarchitectures, let's just use the generic one
    flags+=(TARGET=ARMV8)
elif [[ ${target} == arm-* ]]; then
    flags+=(TARGET=ARMV7)
elif [[ ${target} == powerpc64le-* ]]; then
    flags+=(TARGET=POWER8)
fi

Grabbing one of the compiled BLAS routines from the log, we can see that -mcpu=power8 -mtune=power8 was used.

cc -O2 -DMAX_STACK_ALLOC=2048 -fopenmp -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP -DNO_WARMUP -DMAX_CPU_NUMBER=512 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.20\" -mcpu=power8 -mtune=power8 -mvsx  -fno-fast-math -DUSE_OPENMP -fopenmp -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME=saxpy -DASMFNAME=saxpy_ -DNAME=saxpy_ -DCNAME=saxpy -DCHAR_NAME=\"saxpy_\" -DCHAR_CNAME=\"saxpy\" -DNO_AFFINITY -I.. -I. -UDOUBLE  -UCOMPLEX -c axpy.c -o saxpy.o

Another option that looked promising was using MCPU=power10 in Make.user because of this snippet in julia/Make.inc

# Set MCPU-specific flags
ifneq ($(MCPU),)
CC += -mcpu=$(MCPU)
CXX += -mcpu=$(MCPU)
FC += -mcpu=$(MCPU)
JULIA_CPU_TARGET ?= $(MCPU)
endif

However, this caused a error with corecompiler.jl at build-time since it didn’t recognize power10 as a valid CPU option since JULIA_CPU_TARGET is used in julia/sysimage.mk. MARCH=power10 isn’t possible with PowerPC as it isn’t used in GCC for some reason.

Lastly, there is one patch that I found that may do something for OpenBLAS’ makefile, but I have no idea how to use it or if it took affect: julia/deps/patches/openblas-ofast-power.patch

Am I miss anything or is this doomed to fail?

See

CC @vchuravy for powerpc stuff.

Great to see interest in Power10! I would recommend reaching out to IBM as well and maybe get their help on some of this.

Our problem is that we are building binaries for distribution and so we need to use the most common nominator. You could use libblastrampoline to switch to a more optimized version of BLAS after startup.

Best would be if we could get OPENBLAS_DYNAMIC_ARCH to work and then have support for Power8/9/10.

You might have to force JULIA_CPU_TARGET = pwr10 iff LLVM already supports that, see julia --cpu-target=help.

There is a #power on Slack if you want to chat about these issues.

@vchuravy Thank you for the response. I will reach out to IBM for help on this.

As for the OPENBLAS_DYNAMIC_ARCH, what would be required to make it work for PowerPC? I see it is passed directly to OpenBLAS’ DYNAMIC_ARCH build variable, and that seems to set some additional things.

Also, looking at OpenBLAS/drivers/other seems to be the dynamic_power.c build file that has all the guts available. In the Makefile for this section, it is defined. Wonder why they don’t use it…

ifeq ($(DYNAMIC_ARCH), 1)
ifeq ($(ARCH),arm64)
COMMONOBJS	+=  dynamic_arm64.$(SUFFIX)
else
ifeq ($(ARCH),power)
COMMONOBJS	+=  dynamic_power.$(SUFFIX)
else
ifeq ($(ARCH),zarch)
COMMONOBJS += dynamic_zarch.$(SUFFIX)
else
ifeq ($(ARCH),mips64)
COMMONOBJS += dynamic_mips64.$(SUFFIX)
else
ifeq ($(ARCH),loongarch64)
COMMONOBJS += dynamic_loongarch64.$(SUFFIX)
else
COMMONOBJS	+=  dynamic.$(SUFFIX)

According to Git it was added in Improve PowerPC64 Makefile support · JuliaLang/julia@5be57c7 · GitHub in 2016 and probably no one checked if it was still necessary.

We would also need to adjust it here https://github.com/JuliaPackaging/Yggdrasil/blob/a36ece2d73a1c8b3121d1a6d24bf301ec5d30c12/O/OpenBLAS/common.jl#L145

Here’s what seems to be missing per one of our libblas experts:

Based on the links shared by Brandon, I think DYNAMIC_ARCH=1 is not set for powerpc64 here Yggdrasil/common.jl at a36ece2d73a1c8b3121d1a6d24bf301ec5d30c12 · JuliaPackaging/Yggdrasil · GitHub

Valentin, I think the goal was to have the same binary work for all Power8/9/10 architectures and enabling this would allow it do to the CPU type check and execute the appropriate code.

As I understand it, this is the only flag needed in the build to get libblas to generate the right code on the right CPU arch.

thanks,

gerrit

1 Like

Hi Gerrit, yes, if there are no contraindications using DYNAMIC_ARCH also for PowerPC, we can definitely enable that in our builds of OpenBLAS.

Note that if you have a local build of ILP64 OpenBLAS optimised for Power10, then you can use

using LinearAlgebra
LinearAlgebra.BLAS.lbt_forward("/path/to/your/libopenblas.so")

right now, without waiting for the next version of julia which will come with the optimised OpenBLAS.

@bmgroth and @Gerrit_Huizenga could you give [OpenBLAS_jll] Upgrade to new build optimised for PowerPC by giordano · Pull Request #48689 · JuliaLang/julia · GitHub a try?

I built julia with the following Make.user

USE_BINARYBUILDER_OPENBLAS=0

and then

make -C deps compile-openblas

but it fails during the self tests.

Thank you for all the inputs provided, we have recently built Julia top of the tree with https://github.com/JuliaLang/julia/pull/48689 which includes the patch for OpenBLAS lib, with this we see very minimal differences in CI tests executed with local builds on Power9 & Power8.

FYI latest results -Test execution logs on build kite.
Test Summary: | Pass Error Broken Total Time
Overall | 40968190 3 401 40968594 61m50.5s

with Power9 test cases run with Base.runtests() functionality of Julia -
Test Summary: | Pass Error Broken Total Time
Overall | 40964014 2 401 40964417 45m28.7s

with Power8 test cases run with Base.runtests() functionality of Julia -
Test Summary: | Pass Error Broken Total Time
Overall | 40994264 2 401 40994667 42m19.0s

Hi Pranav, welcome!

Just to clarify, when you talk about minimal differences in CI tests, is that a good or bad thing? :smiley: Are you referring to the total execution time of the test suite? If so, I’m not sure that’s a good performance metric, OpenBLAS plays a small role there and there’s usually lots of variability due to network and I/O tasks.

Hi @giordano

When we started working on the test case differences on CI and local Power builds, the differences were way more than what we observe after testing with patch.
(We had created a document on detailed analysis, but as it’s a text document I cannot upload it.)
for your reference, attaching a screenshot of brief analysis.