What do tokens know about their characters and how do they know it?
Abstract
Pre-trained language models (PLMs) that use <PRE_TAG>subword <PRE_TAG>tokenization schemes</POST_TAG></POST_TAG> can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, <PRE_TAG>BERT</POST_TAG>, <PRE_TAG>Ro<PRE_TAG><PRE_TAG>BERT</POST_TAG>a</POST_TAG></POST_TAG>, GloVe), we probe what <PRE_TAG>word pieces</POST_TAG> encode about <PRE_TAG><PRE_TAG>character-level information</POST_TAG></POST_TAG> by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode <PRE_TAG><PRE_TAG>character-level information</POST_TAG></POST_TAG> and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (<PRE_TAG>Arabic</POST_TAG>, <PRE_TAG>Devanagari</POST_TAG>, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper