Update README.md
Browse files
README.md
CHANGED
|
@@ -32,11 +32,12 @@ set a seed for reproducibility:
|
|
| 32 |
>>> # the previous text.
|
| 33 |
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
|
| 34 |
>>> set_seed(42)
|
| 35 |
-
>>> # This example also illustrates that sometimes our model generates
|
| 36 |
-
>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
|
| 37 |
-
>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
|
| 38 |
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
```
|
| 41 |
|
| 42 |
Here is how to use this model to get the features of a given text in PyTorch:
|
|
@@ -52,7 +53,7 @@ output = model(**encoded_input)
|
|
| 52 |
|
| 53 |
## Dataset
|
| 54 |
|
| 55 |
-
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](
|
| 56 |
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
|
| 57 |
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
|
| 58 |
|
|
|
|
| 32 |
>>> # the previous text.
|
| 33 |
>>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
|
| 34 |
>>> set_seed(42)
|
|
|
|
|
|
|
|
|
|
| 35 |
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
|
| 36 |
+
[{'generated_text': "Hello, I'm a language model, but you want to know if I have a language in that language. Is this possible? Please explain"},
|
| 37 |
+
{'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The C++ API is becoming more and more popular for"},
|
| 38 |
+
{'generated_text': "Hello, I'm a language model, I'm not trying to learn or understand a new tool, my job is to be as happy as"},
|
| 39 |
+
{'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just a curious guy.\n"},
|
| 40 |
+
{'generated_text': "Hello, I'm a language model, I'm not doing anything that needs to be done for the current time (or previous)."}]
|
| 41 |
```
|
| 42 |
|
| 43 |
Here is how to use this model to get the features of a given text in PyTorch:
|
|
|
|
| 53 |
|
| 54 |
## Dataset
|
| 55 |
|
| 56 |
+
The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
|
| 57 |
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
|
| 58 |
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
|
| 59 |
|