icognito commited on
Commit
01108db
·
verified ·
1 Parent(s): 3f5c25d

1. Original: "the dataset might contains offensive content"
Correction: "the dataset might contain offensive content"
Reason: The verb "contain" should be in its base form following "might."

2. Original: "collected form internet"
Correction: "collected from the internet"
Reason: The word "form" should be "from," and "the" should precede "internet."

3. Original: "and went through classic data processing algorithms and re-formatting practices"
Correction: "and went through classic data processing algorithms and reformatting practices"
Reason: "Re-formatting" should be written as "reformatting" without the hyphen.

4. Original: "the self-supervised causal language modedling objective"
Correction: "the self-supervised causal language modelling objective"
Reason: "modedling" is wrong spelling. It can be written as "modelling" or "modeling" instead.

The rest of the text appears free of spelling errors, so these three changes are the key fixes.

Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -37,7 +37,7 @@ To quote the first two paragraphs of the [official paper](https://arxiv.org/abs/
37
  ## Model description
38
 
39
  OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
40
- OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modedling objective.
41
 
42
  For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
43
  the [official paper](https://arxiv.org/abs/2205.01068).
@@ -128,14 +128,14 @@ dataset that was used in RoBERTa (Liu et al., 2019b)
128
  The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
129
  to each dataset’s size in the pretraining corpus.
130
 
131
- The dataset might contains offensive content as parts of the dataset are a subset of
132
  public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
133
  that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
134
 
135
  ### Collection process
136
 
137
- The dataset was collected form internet, and went through classic data processing algorithms and
138
- re-formatting practices, including removing repetitive/non-informative text like *Chapter One* or
139
  *This ebook by Project Gutenberg.*
140
 
141
  ## Training procedure
 
37
  ## Model description
38
 
39
  OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
40
+ OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modelling objective.
41
 
42
  For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
43
  the [official paper](https://arxiv.org/abs/2205.01068).
 
128
  The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
129
  to each dataset’s size in the pretraining corpus.
130
 
131
+ The dataset might contain offensive content as parts of the dataset are a subset of
132
  public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
133
  that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
134
 
135
  ### Collection process
136
 
137
+ The dataset was collected form the internet, and went through classic data processing algorithms and
138
+ reformatting practices, including removing repetitive/non-informative text like *Chapter One* or
139
  *This ebook by Project Gutenberg.*
140
 
141
  ## Training procedure