Papers
arxiv:2410.24050

Clustering Head: A Visual Case Study of the Training Dynamics in Transformers

Published on Oct 31, 2024
Authors:
,

Abstract

This paper introduces the sparse modular addition task and examines how transformers learn it. We focus on transformers with embeddings in R^2 and introduce a visual sandbox that provides comprehensive visualizations of each layer throughout the training process. We reveal a type of circuit, called "clustering heads," which learns the problem's invariants. We analyze the training dynamics of these circuits, highlighting two-stage learning, loss spikes due to high curvature or normalization layers, and the effects of initialization and curriculum learning.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.24050 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.24050 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.24050 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.