|
<!--Copyright 2020 The HuggingFace Team. All rights reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
|
the License. You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
|
specific language governing permissions and limitations under the License. |
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be |
|
rendered properly in your Markdown viewer. |
|
|
|
--> |
|
|
|
# ALBERT |
|
|
|
<div class="flex flex-wrap space-x-1"> |
|
<a href="https://huggingface.co/models?filter=albert"> |
|
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet"> |
|
</a> |
|
<a href="https://huggingface.co/spaces/docs-demos/albert-base-v2"> |
|
<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"> |
|
</a> |
|
</div> |
|
|
|
## Overview |
|
|
|
The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, |
|
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training |
|
speed of BERT: |
|
|
|
- Splitting the embedding matrix into two smaller matrices. |
|
- Using repeating layers split among groups. |
|
|
|
The abstract from the paper is the following: |
|
|
|
*Increasing model size when pretraining natural language representations often results in improved performance on |
|
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, |
|
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction |
|
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows |
|
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a |
|
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks |
|
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and |
|
SQuAD benchmarks while having fewer parameters compared to BERT-large.* |
|
|
|
Tips: |
|
|
|
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather |
|
than the left. |
|
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains |
|
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same |
|
number of (repeating) layers. |
|
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters. |
|
- Layers are split in groups that share parameters (to save memory). |
|
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not. |
|
|
|
|
|
This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by |
|
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). |
|
|
|
## Documentation resources |
|
|
|
- [Text classification task guide](../tasks/sequence_classification) |
|
- [Token classification task guide](../tasks/token_classification) |
|
- [Question answering task guide](../tasks/question_answering) |
|
- [Masked language modeling task guide](../tasks/masked_language_modeling) |
|
- [Multiple choice task guide](../tasks/multiple_choice) |
|
|
|
## AlbertConfig |
|
|
|
[[autodoc]] AlbertConfig |
|
|
|
## AlbertTokenizer |
|
|
|
[[autodoc]] AlbertTokenizer |
|
- build_inputs_with_special_tokens |
|
- get_special_tokens_mask |
|
- create_token_type_ids_from_sequences |
|
- save_vocabulary |
|
|
|
## AlbertTokenizerFast |
|
|
|
[[autodoc]] AlbertTokenizerFast |
|
|
|
## Albert specific outputs |
|
|
|
[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput |
|
|
|
[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput |
|
|
|
## AlbertModel |
|
|
|
[[autodoc]] AlbertModel |
|
- forward |
|
|
|
## AlbertForPreTraining |
|
|
|
[[autodoc]] AlbertForPreTraining |
|
- forward |
|
|
|
## AlbertForMaskedLM |
|
|
|
[[autodoc]] AlbertForMaskedLM |
|
- forward |
|
|
|
## AlbertForSequenceClassification |
|
|
|
[[autodoc]] AlbertForSequenceClassification |
|
- forward |
|
|
|
## AlbertForMultipleChoice |
|
|
|
[[autodoc]] AlbertForMultipleChoice |
|
|
|
## AlbertForTokenClassification |
|
|
|
[[autodoc]] AlbertForTokenClassification |
|
- forward |
|
|
|
## AlbertForQuestionAnswering |
|
|
|
[[autodoc]] AlbertForQuestionAnswering |
|
- forward |
|
|
|
## TFAlbertModel |
|
|
|
[[autodoc]] TFAlbertModel |
|
- call |
|
|
|
## TFAlbertForPreTraining |
|
|
|
[[autodoc]] TFAlbertForPreTraining |
|
- call |
|
|
|
## TFAlbertForMaskedLM |
|
|
|
[[autodoc]] TFAlbertForMaskedLM |
|
- call |
|
|
|
## TFAlbertForSequenceClassification |
|
|
|
[[autodoc]] TFAlbertForSequenceClassification |
|
- call |
|
|
|
## TFAlbertForMultipleChoice |
|
|
|
[[autodoc]] TFAlbertForMultipleChoice |
|
- call |
|
|
|
## TFAlbertForTokenClassification |
|
|
|
[[autodoc]] TFAlbertForTokenClassification |
|
- call |
|
|
|
## TFAlbertForQuestionAnswering |
|
|
|
[[autodoc]] TFAlbertForQuestionAnswering |
|
- call |
|
|
|
## FlaxAlbertModel |
|
|
|
[[autodoc]] FlaxAlbertModel |
|
- __call__ |
|
|
|
## FlaxAlbertForPreTraining |
|
|
|
[[autodoc]] FlaxAlbertForPreTraining |
|
- __call__ |
|
|
|
## FlaxAlbertForMaskedLM |
|
|
|
[[autodoc]] FlaxAlbertForMaskedLM |
|
- __call__ |
|
|
|
## FlaxAlbertForSequenceClassification |
|
|
|
[[autodoc]] FlaxAlbertForSequenceClassification |
|
- __call__ |
|
|
|
## FlaxAlbertForMultipleChoice |
|
|
|
[[autodoc]] FlaxAlbertForMultipleChoice |
|
- __call__ |
|
|
|
## FlaxAlbertForTokenClassification |
|
|
|
[[autodoc]] FlaxAlbertForTokenClassification |
|
- __call__ |
|
|
|
## FlaxAlbertForQuestionAnswering |
|
|
|
[[autodoc]] FlaxAlbertForQuestionAnswering |
|
- __call__ |
|
|