olm
/

olm-gpt2-dec-2022

Text Generation

text-generation-inference

Model card Files Files and versions

Metrics Training metrics Community

Tristan commited on Jan 15, 2023

Commit

cdf87ed

·

1 Parent(s): 6cf19d8

Update README.md

Files changed (1) hide show

README.md +6 -5

README.md CHANGED Viewed

@@ -32,11 +32,12 @@ set a seed for reproducibility:
 >>> # the previous text.
 >>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
 >>> set_seed(42)
->>> # This example also illustrates that sometimes our model generates
->>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
->>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
 >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
-TODO
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
@@ -52,7 +53,7 @@ output = model(**encoded_input)
 ## Dataset
-The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](TODO) plus this [December 2022 cleaned Wikipedia dataset](TODO).\
 The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
 The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).

 >>> # the previous text.
 >>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
 >>> set_seed(42)
 >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
+[{'generated_text': "Hello, I'm a language model, but you want to know if I have a language in that language. Is this possible? Please explain"},
+ {'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The C++ API is becoming more and more popular for"},
+ {'generated_text': "Hello, I'm a language model, I'm not trying to learn or understand a new tool, my job is to be as happy as"},
+ {'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just a curious guy.\n"},
+ {'generated_text': "Hello, I'm a language model, I'm not doing anything that needs to be done for the current time (or previous)."}]
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
 ## Dataset
+The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
 The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
 The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).