hi, I am new to Transformers.jl and try to follow the tutorial (Tutorial · Transformers.jl). I wonder where I can find more details about this call
t = decoder_trf(e, m, attention_mask, cross_attention_mask)
In particular, how to modify the above to allow a causal mask to be applied to the decoder input (to avoid peeking ahead). Many thanks!
You don’t need to do that manually. The TransformerDecoderBlock constructor create a CausalMultiheadQKVAttenOp for the self attention, which does the causal masking already. The basic functionality of attention_mask in decoder is for putting something like LengthMask for avoiding padding affect the computation.
Yes, but you would probably need to call the inner-most constructor with MultiheadQKVAttenOp and pass your own attention mask from the input. You can find NeuralAttentionlib for more kind of masks.