Technically softmax is not implemented as presented but through exp(x_i-max(x)), and summing over it in the denom. But maybe I am missing something.
Furthermore, the residuals are used exactly because the networks cant learn the identity function; but they can learn zero; at which point the residual is `f(x): x+g(x)` with being `g:x ~> 0` (ie approximately 0).
It is also the case that `f(x): x+g(x)` makes it easier for gradients to flow through.
How can `x -> -inf` occur in the first place when nearly everything is within [-2,2] and doing a dot product plus before that there's normalization too?
The use of the "nearly" in your comment is exactly occluding the issue as presented.
Enough weights don't fall under that "nearly" that we require more bits per weight to cover those edge cases. If we were able to delete the "nearly" we would need fewer bits (smaller models).
The idea is that if your range of values is small enough you need fewer bits to distinguish between meaningfully different values. The problem is that exp(x) << exp(y) for sufficiently wide ranges [x, y], so that when normalizing in the softmax and subsequently quantizing you don't get the fidelity you need and too much information is lost between layers. The proposed solution is that modifying the softmax step slightly brings x and y close enough to zero that exp(x) and exp(y) are close enough so that more compact quantizations are useful instead of useless.
Technically softmax is not implemented as presented but through exp(x_i-max(x)), and summing over it in the denom. But maybe I am missing something.
Furthermore, the residuals are used exactly because the networks cant learn the identity function; but they can learn zero; at which point the residual is `f(x): x+g(x)` with being `g:x ~> 0` (ie approximately 0).
It is also the case that `f(x): x+g(x)` makes it easier for gradients to flow through.