Quaternion-, and up to, sedenion-valued neural networks: Parallelizing Hamilton product on GPUs/CUDA

Palli · October 15, 2020, 4:59pm

[Edit: See my next comment on August 2020 Sedenion-networks, and there and here a bit on Octonion-networks. I changed the title yet again replacing Octonion with Sedenion as to not make it too long.]

The specific four dimensional algebra of quaternion numbers, including the Hamilton product allows quaternion-valued models to consistently outperform equivalent real-valued neural networks.

Last time I checked complex-valued ANNs weren’t mainstream (people I asked didn’t even know of), and I first now read about quaternion-valued, neither brand-new, in fact very old idea, and it seems it could be a killer application for Julia. I think Julia would be ideal to explore this gap:

The Hamilton product is a powerful but expensive operation. In current implementations, this operation is not fully parallelized on GPUs, implying a longer training time for quaternion neural networks. A proper CUDA implementation of this product would drastically reduce this computation time and makes of QNNs a mandatory alternative to real-valued NNs.

All-else-equal, it seems quternion-valued would be better, but everything else isn’t equal, architectures are quckly evolivng with e.g. transformer networks (not mentioned in the survey article), and capsule-networks, it may not help only extending outdated convolutional to QCNN:

New architectures Despite a recent QCNN and QRNN, new neural networks architectures are still missing. For example, capsule networks, or generative adversarial neural networks could benefits from the introduction of quaternion numbers.

Deep Octonion Networks

Quaternion-valued multi-layer per-
ceptrons (QMLP), and autoencoders (QAE) have been intro-
duced to capture such latent dependencies, alongside to rep-
resent multidimensional data. Nonetheless, a three-layered
neural network does not benefit from the high abstraction
capability of DNNs. The paper proposes first to extend the
hyper-complex algebra to deep neural networks (QDNN) and,
then, introduces pre-trained deep quaternion neural networks
(QDNN-AE) with dedicated quaternion encoder-decoders
(QAE). The experiments conduced on a theme identification
task of spoken dialogues from the DECODA data set show,
inter alia, that the QDNN-AE reaches a promising gain of
2.2% compared to the standard real-valued DNN-AE.

Similarly to capsules, quaternions allow the QRNN to code internal dependencies by composing and processing multidimensional features as single entities, while the recurrent operation reveals correlations between the elements composing the sequence. We show that both QRNN and QLSTM achieve better performances than RNN and LSTM in a realistic application of automatic speech recognition. Finally, we show that QRNN and QLSTM reduce by a maximum factor of 3.3x the number of free parameters needed, compared to real-valued RNNs and LSTMs to reach better results, leading to a more compact representation of the relevant information.

Palli · October 15, 2020, 7:07pm

Add info here on Sedenion networks from August 2020 paper. First what I wrote before seeing that:

have a look at octernions. It seems like they could outperform quaternions by a tiny bit.

Octonion networks are probably worth a better look. From that paper, fig. 5, yes, accuracy curve is best, while only a little better than quaterions, but both much better than for real, and complex comes close at epoch 20.

However, number of parameters is 481,150 half of quaterions, and real-valued has 7.5x as many. It seems to roughly half for each step from real->complex->quaternions->octanions.

So I thought, since there are even higher-order complex number, lets look up sedenions.

There are also other ways to squeeze networks, where the original is e.g. 50x larger, while unclear why you shouldn’t be able to get 50*7.5=375 factor from such methods plus above, and maybe an additional 2x factor with below:

https://www.researchgate.net/publication/343255151_Metacognitive_Sedenion-Valued_Neural_Network_and_Its_Learning_Algorithm

In this paper, a metacognitive sedenion-valued neural network (Mc-SVNN) and its learning algorithm are proposed. Its application to diverse time-series prediction problems is presented. The Mc-SVNN contains two components: a sedenion-valued neural network that represents the cognitive component, and a metacognitive component, which serves to self-regulate the learning algorithm. At each epoch, the metacognitive component decides what, how, and when learning occurs. The algorithm deletes unnecessary samples and stores only those that are used. This decision is determined by the sedenion magnitude and the 15 sedenion phases. The Mc-SVNN is applied to four real-world forecasting problems: USD-to-euro currency exchange rate forecasting, the sunspot number time series, power demand forecasting, and daily temperature prediction in Abu Dhabi. Compared to existing methods, the Mc-SVNN demonstrates superior performance in time-series forecasting while using a smaller number of parameters.

This is the first time I’ve heard of a real application of (let alone higher order, that I only first now hear about):

Applying the Cayley–Dickson construction to the sedenions yields a 32-dimensional algebra, sometimes called the 32-ions or trigintaduonions.[1] It is possible to apply the Cayley–Dickson construction to the sedenions arbitrarily many times.

So I had to look up trigintaduonions, which do not have their own Wikipedia page and I couldn’t find anything relating them to neural networks, there’s however:

2014 paper:
https://www.researchgate.net/publication/267097033_An_algorithm_for_multiplication_of_trigintaduonions

In this paper we introduce efficient algorithm for the multiplication of trigintaduonions. The direct multiplication of two trigintaduonions requires 1024 real multiplications and 992 real additions. We show how to compute a trigintaduonion product with 498 real multiplications and 943 real additions. During synthesis of the discussed algorithm we use a fact that trigintaduonion multipli-cation may be represented by a vector-matrix product. Such representation provides a possibility to discover repeating elements in the matrix structure and to use specific properties of their mutual placement to decrease the number of real multiplications needed to compute the product of two trigintaduonions.

2013 paper:
https://www.researchgate.net/publication/256720850_An_algorithm_for_fast_multiplication_of_sedenions

In this work a rationalized algorithm for calculating the product of sedenions is presented which reduces the number of underlying multiplications. Therefore, reducing the number of multiplications in VLSI processor design is usually a desirable task. The computation of a sedenion product using the naive method takes 256 multiplications and 240 additions, while the proposed algorithm can compute the same result in only 122 multiplications (or multipliers – in hardware implementation case) and 298 additions.

@chakravala, I thought you might know anything about this stuff, or at least be interested, and @Elrod, an idea if we can make these calculations fast in Julia (faster then competing neural network libraries can, I doubt any of them even have this implemented yet, except for naively).

A sedenion is a 16-dimensional hypercomplex
number that is obtained by applying the Cayley–Dickson
construction to octonion complex numbers. Its algebra is
non-commutative, non-associative, and non-alternative, but
power-associative

This is from memory what I remembered, you loose one more each time, here no longer alternative (I’m familiar with preceding), I guess with trigintaduonions you next lose power-associative, but I’m curious what happens for the next higher, 64-ions, that do not even have a name. What propertiy more is there to lose?

Even if these higher-order are slower, they might be worthwhile, as you pay the training cost once. And accuracy is very important. I’m not sure how this affects inference, same order of slowdown for calculations, but since your network is smaller, you may be limited by memory bandwidth thee too anyway (but less so).

Elrod · October 15, 2020, 7:48pm

I’d take a struct of arrays approach, and just spill some registers on CPUs without AVX512*.
Preferably, arrange the data of S = size(A) array as either
(W, 16, cld(S[1], W), S[2:end]...) in one array, where W is the SIMD vector width of the host CPU, and the 16 corresponds to e_0, e_1, ..., e_15, or
(S[1], 16, S[2:end]...) if the former is too complicated.
You don’t literally want a struct of arrays, because that’d be 16 arrays, which would eat a lot of integer registers, and involve accessing memory spread out through the cache. It also will allow for the possibility of aliasing, which will force LLVM to be more conservative when optimizing.
The advantage of the former (where you split the first dim into pieces of length W) is that now each vector load is a constant offset from one another, rather than a stride multiple apart, which will require more indexing calculations and again occupy more integer registers.

*Although you may be able to look at the operations and see an order which doesn’t require spilling. The lazy way would be to just generate the code, and see if LLVM can figure it out.

Palli · October 15, 2020, 9:33pm

A sedenion is “only” 16 numbers, so to multiply (or add or whatever) two of those you need to access 32 numbers, so not all can be in registers (one of, i.e. 16 sub-numbers, could be in registers on x86-64). With SIMD maybe, with many sub-number sharing one register. I actually have an idea related to that before anyone starts implementing stuff.

Did you have the naive method in mind?

The computation of a sedenion product using the naive method takes 256 multiplications and 240 additions, while the proposed algorithm can compute the same result in only 122 multiplications (or multipliers – in hardware implementation case) and 298 additions.

It seems to be like matrix multiply, could use loops, while the other way isn’t regular (I guess, haven’t looked at that paper yet), so might be more complicated to implement (potentially code generatein in Julia could help).

I’ve now however read the Sedations-neural network paper, e.g.:

An SVNN that takes one sedenion input and predicts one
sedenion output uses 2 × N 1 sedenion multiplications, which
is equivalent to 16 × 16 × 2 × N 1 = 512 × N 1 real-
valued multiplications. The comparable real-valued network
with 16 inputs and outputs requires 32 × N 1 real-valued
multiplications. The use of sedenion numbers increases the
number of multiplications by a factor of 16. We would like to
note that despite the increase in multiplication costs, SVNNs
converge faster than real-valued networks.

But for inference of a network, you would pay 16x for sedenions (and I guess lower) multiply tax for the forward phase. That should be ok, with the best alternative, LSTM and LSTM-SARIMA having 211200/496 = 425x more parameters.

And that 16x is 512/32 seemingly assuming the naive approach, that could be improved.

To answer my own question here:

I’m curious what happens for the next higher, 64-ions, that do not even have a name. What propertiy more is there to lose?

indicates to me in the table there, that nothing changes after sedenions, 64-ions, 128-ions, etc. Could those higher order help even more in neural networks, why stop at 16-dimensional?

4-dimensional numbers were said to help e.g. because colors are 3-d, RGB, so maybe more than quaternions doesn’t help depending on applications. That might be wrong thinking, as in some sense a 2D picture is e.g. 1024 dimensions even for a tiny greyscale 32x32 icon. In color, times three, requring at least 12 hidden layers:

I guess many think the universal approximation theory requires only one hidden layer. Actually it’s only for scalar functions, for 2 and 3 dimensional it requires 2 hidden layers, and in general log2(dimensions) hidden layers. I saw this in a recent paper proving it, clearing up for me why deep learning is a thing, when networks shouldn’t have needed that.

The proof is likely affected by such hyper-complex numbers, assuming such or complex not used.

Elrod · October 15, 2020, 9:54pm

AVX512 has 32 vector registers, which would help.
However, I would prefer to have each register contain multiple of the same piece from different sedenions. E.g.,
v_1 = <e0_1, e0_2, e0_3, e0_4, e0_5, e0_6, e0_7, e0_8>, the e0 value from sedenions 1 through 8.
You could search for a pattern of minimum shuffles to actually get SIMD on your adds/multiplies, but I’d prefer to look for a pattern that may minimize reads/writes.

What operations are being performed?
If we have matrix multiplication, without AVX512 I’d look into calculating e0-e7 in one dot product, and then getting the remaining e8-e15 in a second dot product.
You then consume 8 registers in the accumulation. Note that this would mean each dot product calculates half of W sedenions, where W is the vector width.
You’re loading up to 32 registers total for x[i] * y[i], but then I’d look for patterns of updating e0-e7 (and similarly for e8-e15) while minimizing the number of live values you need at a time. E.g., maybe you can load e0_x and use it for everything it is needed for, and then replace it with e1_x.

Did you have the naive method in mind?

I’m not actually familiar with sedenions, so yes.

The computation of a sedenion product using the naive method takes 256 multiplications and 240 additions, while the proposed algorithm can compute the same result in only 122 multiplications (or multipliers – in hardware implementation case) and 298 additions.

How many fmas can we use? 240 fmas + 16 multiplications is faster than 122 fmas and 186 additions.

Palli · October 15, 2020, 10:03pm

I was actually just thinking of multiplying two such “numbers”, but you’re right, it would be matrix-multiply of those, similar for other neural networks. In their case they have 10 neurons in hidden layer and only 1 input, but it could be a fully connected layer, and I’m just not up-to-speed. I think it’s matrix-vector? At least not matrix-matrix.

And while you have AVX512 (wasn’t aware it’s 32 registers there, assuming less for regular AVX), not all do, and you likely want this on a GPU. I’m just not sure about the size of the register files there.

Did you see about future Intel chips, with an extra 8KB register file for matrix multiply, some special instructions? If I recall, coming next year.

Elrod · October 15, 2020, 10:15pm

Ideally, we could also carry some of the loaded numbers (e.g. from the vector) from one iteration to the next. If you really want to optimize this, I’d use a pen and paper to try and compare strategies, maybe even going so far as to build a jump model of the problem and architecture to try and minimize the number of loads and stores.

I’m not either.

For reference, AVX(2) has 0.5 KiB (16 registers * 32 bytes/register) and AVX512 has 2 KiB (32 registers * 64 bytes/register).
This 8 KiB (8 matrix registers; each register is 16 rows of 64 bytes for 1 KiB/register) is in addition to AVX512; they’re separate registers.
The matrix multiply on these tiles will only support (U)Int8 and BFloat16, at least on the first CPUs supporting it.

Palli · October 17, 2020, 11:34pm

FYI: If there’s anyone interested in sedenions (or the other hyper-complex numbers), and/or networks based on them, there’s a package for them. Not a package for networks using those, but Julia wouldn’t surprise me if I could just use this package and NN code and extended types just work.

The package is just outdated, and I’m trying to upgrade a package for the first time, maybe I’m doing it wrong (I made the file) and/or the package was to old for the upgrade process:

pharaldsson_sym:~/myjulia/julia/CayleyDickson.jl$ ~/julia-1.5.1/bin/julia gen_project.jl 
ERROR: LoadError: UndefVarError: METADATA_compatible_uuid not defined
Stacktrace:
 [1] uuid(::SubString{String}) at /home/pharaldsson_sym/myjulia/julia/CayleyDickson.jl/gen_project.jl:45
 [2] top-level scope at /home/pharaldsson_sym/myjulia/julia/CayleyDickson.jl/gen_project.jl:118
 [3] include(::Function, ::Module, ::String) at ./Base.jl:380
 [4] include(::Module, ::String) at ./Base.jl:368
 [5] exec_options(::Base.JLOptions) at ./client.jl:296
 [6] _start() at ./client.jl:506
in expression starting at /home/pharaldsson_sym/myjulia/julia/CayleyDickson.jl/gen_project.jl:98

Topic		Replies	Views
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6774	October 6, 2018
How much faster is GPU compare to CPU GPU	16	26943	November 24, 2018
GPU simulation time of neural ODE program Machine Learning diffeq , flux	8	903	August 16, 2019
Basics steps to build an ANN? Machine Learning	24	3186	March 18, 2019
On Machine Learning and Programming Languages Machine Learning	48	8820	January 25, 2018

There is more …

Deep Octonion Networks

Related topics