Update README.md
Browse files
README.md
CHANGED
@@ -17,9 +17,9 @@ are not at the output of the network](https://ai.meta.com/research/publications/
|
|
17 |
|
18 |
|
19 |
## Perception Encoder: Language
|
20 |
-
PE lang takes the strong language performance from the intermediate layers of PE core and aligns
|
21 |
|
22 |
-
We release two PE Lang checkpoints. Here are their results
|
23 |
|
24 |
| Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
|
25 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
@@ -28,13 +28,13 @@ We release two PE Lang checkpoints. Here are their results benchmarked in the fr
|
|
28 |
|
29 |
|
30 |
|
31 |
-
Here is a sample of the performance obtainable by using PE
|
32 |
|
33 |
| Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
|
34 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
35 |
-
| PLM-8B | [PE-
|
36 |
|
37 |
-
\*
|
38 |
|
39 |
See the paper for full performance evaluations and fair comparisons to other models.
|
40 |
|
|
|
17 |
|
18 |
|
19 |
## Perception Encoder: Language
|
20 |
+
PE lang takes the strong language performance from the intermediate layers of PE core and further aligns for language modeling following [PLM](https://huggingface.co/papers/2504.13180). We specifically tuned PE lang to be versatile for any multimodal langugage modeling use case, including using different language model decoders (e.g., Llama / Qwen) and using different eval settings (e.g., native res / tiling). PE lang performs particularly well on OCR and document tasks.
|
21 |
|
22 |
+
We release two PE Lang checkpoints, L14-448 and G14-448. Here are their results our benchmark setting with frozen encoder with 2.6M SFT datamix, using 448px _only_ (i.e., _with no tiling_) and Llama 3.1 8B as the decoder:
|
23 |
|
24 |
| Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
|
25 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
28 |
|
29 |
|
30 |
|
31 |
+
Here is a sample of the performance obtainable by using PE Core G aligned further with [PLM-8B](https://huggingface.co/facebook/Perception-LM-8B) (*stage 3*) using 36+1 image tiles / 32 video frames with Llama 3.1 8B as the decoder:
|
32 |
|
33 |
| Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
|
34 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
35 |
+
| PLM-8B | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448)* | 94.6 | 78.8 | 86.5 | 77.1 | 82.7 | 68.8 |
|
36 |
|
37 |
+
\* The PE-Core-G14-448 checkpoint was further trained using tiling. We will release the tiling aligned checkpoint soon.
|
38 |
|
39 |
See the paper for full performance evaluations and fair comparisons to other models.
|
40 |
|