I could not find a rigorous definition of the feature norms in the paper. Which layer or block do the tokens originate from?
Regarding the attention maps, I assume that the norms are based on the linearly transformed tokens used to calculate the attention matrices. According to LayerNorm, all tokens should have a norm of
Am I misunderstanding something?
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too