4 fixes

1. Original: "the dataset might contains offensive content"
Correction: "the dataset might contain offensive content"
Reason: The verb "contain" should be in its base form following "might."

2. Original: "collected form internet"
Correction: "collected from the internet"
Reason: The word "form" should be "from," and "the" should precede "internet."

3. Original: "and went through classic data processing algorithms and re-formatting practices"
Correction: "and went through classic data processing algorithms and reformatting practices"
Reason: "Re-formatting" should be written as "reformatting" without the hyphen.

4. Original: "the self-supervised causal language modedling objective"
Correction: "the self-supervised causal language modelling objective"
Reason: "modedling" is wrong spelling. It can be written as "modelling" or "modeling" instead.

The rest of the text appears free of spelling errors, so these three changes are the key fixes.

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -37,7 +37,7 @@ To quote the first two paragraphs of the [official paper](https://arxiv.org/abs/
 ## Model description
 OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
-OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modedling objective.
 For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
 the [official paper](https://arxiv.org/abs/2205.01068).
@@ -128,14 +128,14 @@ dataset that was used in RoBERTa (Liu et al., 2019b)
 The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
 to each dataset’s size in the pretraining corpus.
-The dataset might contains offensive content as parts of the dataset are a subset of
 public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
 that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
 ### Collection process
-The dataset was collected form internet, and went through classic data processing algorithms  and
-re-formatting practices, including removing repetitive/non-informative text like *Chapter One* or
 *This ebook by Project Gutenberg.*
 ## Training procedure

 ## Model description
 OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
+OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modelling objective.
 For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
 the [official paper](https://arxiv.org/abs/2205.01068).
 The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
 to each dataset’s size in the pretraining corpus.
+The dataset might contain offensive content as parts of the dataset are a subset of
 public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
 that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
 ### Collection process
+The dataset was collected form the internet, and went through classic data processing algorithms  and
+reformatting practices, including removing repetitive/non-informative text like *Chapter One* or
 *This ebook by Project Gutenberg.*
 ## Training procedure