Julia Implementation of Transformer Neural Network Model

I developed an implementation of a Transformer model using Flux 0.8.2. The implementation is based on The Annotated Transformer

If this is of interest, I am happy to contribute it (I don’t know how or where the best place to do that is). But it’s up on github so have a look.

I’m brand new to Julia, so I could use advice on Julia code style and improving performance. Both the token processing speed and convergence seem slower than the Annotated Transformer, written in Python. The trivial demo (included in the source) converges after roughly 30 epochs of 20 batches with 30 example sequences each, whereas the reference implementation converged after 6 similar epochs. I must be missing something, but I’m even less experienced with Python, and although I’ve scoured the code, I haven’t found any meaningful differences to the reference.


Also note, that to see a working demo of the Transformer, see the function in transformer_demo() in src/transformer_demo.jl

1 Like

Thanks for this. I almost did exactly this about a month back but had other priorities.

@jekbradbury might be able to help