Spaces:

mebubo
/

gpted

Sleeping

mebubo commited on 29 days ago

Commit

bcf7cd8

1 Parent(s): e135ffb

Typo

Files changed (1) hide show

README.md CHANGED Viewed

@@ -205,7 +205,7 @@ The main limitation of using decoder-only models like GPT or Llama for this task
 ## A digression on encoder vs decoder, unidirectional vs bidirectional attention, and whether we could use bidirectional attention for text generation
-It is a common misconseption that autoregressive text generation _requires_ unidirectional attention, whereas in reality it is only a matter of efficiency (efficiency at both training and inference time). It is possible to use models with bidirectional attention autoregressively, and arguably it would give better quality than unidirectional attention (the bidirectional flow of information between tokens in the current prefix can only be beneficial, e.g. if we are generating the next token in "the quick brown fox jumped over", there is no benefit in not letting "fox" to see "jumped"). However, bidirectional attention would mean that we cannot learn from every token in a text by passing only 1 instance of it through the model, we would have to pass every token individually. And at inference time, it would rule out the techniques such as KV caches which are used ubiquitously at all modern LLM deployments for inference, because all attention would need to be recomputed for every prefix.
 ## Part 2

 ## A digression on encoder vs decoder, unidirectional vs bidirectional attention, and whether we could use bidirectional attention for text generation
+It is a common misconseption that autoregressive text generation _requires_ unidirectional attention, whereas in reality it is only a matter of efficiency (efficiency at both training and inference time). It is possible to train models with bidirectional attention on next token prediction, and to use them autoregressively at inference, and arguably it would give better quality than unidirectional attention (the bidirectional flow of information between tokens in the current prefix can only be beneficial, e.g. if we are generating the next token in "the quick brown fox jumped over", there is no benefit in not letting "fox" to see "jumped"). However, bidirectional attention would mean that we cannot learn from every token in a text by passing only 1 instance of it through the model, we would have to pass every prefix individually. And at inference time, it would rule out the techniques such as KV caches which are used ubiquitously at all modern LLM deployments for inference, because all attention would need to be recomputed for every prefix.
 ## Part 2