Julia is said to be an expressive ML language where you do the obvious thing, putting in math expression and it works. But… how does it work exactly? To try it out, I want to implement my own (potentially stupid) idea for a language model.

This idea is based on three basic assumptions.

- Parallel computation between all nodes is good. (Like Transformer)
- Low complexity is good (Like RNN to LSTM)
- Attention is good (From GRU to transformer and so on).

So, I proposed a (potentially stupid) idea based on binary attention.

The main connections (inspired by wavenet, stride is 1, 2, 4, 8, …)

Each connection is a binary attention.

● If the left side is empty, set the sigmoid result to 1, meaning

attending fully. If the right side is empty, set the sigmoid result to 0.

● Go both directions, use this set of binary attention

blocks to weight the result’s value. Now, it means that

all tokens have attended both forward and backward!

● Share weight at each layer. Do this until all tokens have inputs from all tokens from previous layers.

● Repeat this process N times.

So, with this stupid idea, I want to try implementing it. How do I go about it?