IMHO:
The first step has been started: Reverse Engineering
Apple Matrix coprocessor - Reverse Engineering
- https://chowdera.com/2021/02/20210201073032221f.html
- aarch64_amx.py · GitHub
- https://news.ycombinator.com/item?id=25559145 ( 10month ago )
...
# AMX: Apple Matrix coprocessor
#
# This is an undocumented arm64 ISA extension present on the Apple M1. These
# instructions have been reversed from Accelerate (vImage, libBLAS, libBNNS,
# libvDSP and libLAPACK all use them), and by experimenting with their
# behaviour on the M1. Apple has not published a compiler, assembler, or
# disassembler, but by callling into the public Accelerate framework
# APIs you can get the performance benefits (fast multiplication of big
# matrices). This is separate from the Apple Neural Engine.
#
# Warning: This is a work in progress, some of this is going to be incorrect.
#
# This may actually be very similar to Intel Advanced Matrix Extension (AMX),
# making the name collision even more confusing, but it's not a bad place to
# look for some idea of what's probably going on.
...
Apple M1 Neural Engine - Reverse Engineering
Apple M1 GPU - Reverse Engineering
And as usual - adding “reverse engineering” for the keywords … you can check the latest status
Related thread: