Based on @ChrisRackauckas’s comment here, fusing the matrix multiplication + add in Dense should allow the existing vectorized tanh broadcast to kick in. It seems like IntelVectorMath.jl also accomplishes this, but AIUI using fused mul! should also remove the intermediate allocation incurred when adding the bias vector (this is what PyTorch does).