Traditionally, dominant models rely on complex recurrent or convolutional neural 
networks with both an encoder and a decoder. The most effective models also incorporate 
an attention mechanism to connect the encoder and decoder. We present the Transformer, 
a new, simplified architecture built entirely on attention mechanisms, eliminating the 
need for recurrence and convolutions. Experiments on two machine translation tasks 
demonstrate that these models outperform previous approaches in quality, while also 
being more parallelizable and requiring considerably less training time.