SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper
•
2501.17161
•
Published
•
100
•
6
Good write-up, though it is missing the dominant attention sink in current decoder-only models:
https://colab.research.google.com/drive/1Fcgug4a6rv9F-Wej0rNveiM_SMNZOtrr?usp=sharing