The other answers cover the math well, but I think the “why do you need attentio...

The other answers cover the math well, but I think the “why do you need attention?” statement is worth making (and answers the more engineering-y question of “how/when?”):

DNNs typically operate on fixed-size tensors (often with a variable batch size, which you can safely ignore). In order to incorporate a non-fixed size tensor, you need some way of converting it into a fixed size. For example, processing a sentence of variable length into a single prediction value. You have many choices for methods of combining the tensors from each token in the sentence - max, min, mean, median, sum, etc etc. Attention is a weighted mean, where the weights are computed based on a query, key, and value. The query might represent something you know about the sentence or the context (“this is a sentence from a toaster review”), the key represents something you know about each token (“this is the word embedding tensor”), and the value is the tensor you want to use for the weighted mean.