I developed an implementation of a Transformer model using Flux 0.8.2. The implementation is based on The Annotated Transformer
If this is of interest, I am happy to contribute it (I don’t know how or where the best place to do that is). But it’s up on github so have a look.
I’m brand new to Julia, so I could use advice on Julia code style and improving performance. Both the token processing speed and convergence seem slower than the Annotated Transformer, written in Python. The trivial demo (included in the source) converges after roughly 30 epochs of 20 batches with 30 example sequences each, whereas the reference implementation converged after 6 similar epochs. I must be missing something, but I’m even less experienced with Python, and although I’ve scoured the code, I haven’t found any meaningful differences to the reference.