Speed Comparison Python v Julia for custom layers

Would not expect Dense to be faster than doing the matrix multiplication yourself – in the end, its just doing the same thing under the hood anyways as you can check with @less Wq_dense(xi) from the REPL. Just thought that it’s closer to using a linear layer in Python as this should include a bias as well.

Also, be careful when using a single type parameter in your struct CustomDense{Q} as this requires the fields Wq, Wk and Wv to have the same type. Dense also stores the activation function in its type and thus, you cannot use different ones across your fields – it’s of course fine if you won’t ever need/want that.