facebook
/

PE-Lang-G14-448

Image Feature Extraction

PerceptionEncoder

Model card Files Files and versions

xet

Community

janghyuncho7 commited on Apr 19

Commit

59f4b4f

verified ·

1 Parent(s): fb60305

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -17,9 +17,9 @@ are not at the output of the network](https://ai.meta.com/research/publications/
 ## Perception Encoder: Language
-PE lang takes the strong language performance from the intermediate layers of PE core and aligns it to the end for use with large language models. We specifically tuned PE lang to be versatile for any multimodal langugage modeling use case, including using different language model decoders (e.g., Llama / Qwen) and using different eval settings (e.g., native res / tiling). PE lang performs particularly well on OCR and document tasks.
-We release two PE Lang checkpoints. Here are their results benchmarked in the frozen encoder [PLM-8B](../plm/README.md) benchmark SFT using 448px _only_ (i.e., _with no tiling_) and Llama 3.1 8B as the decoder:
 | Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -28,13 +28,13 @@ We release two PE Lang checkpoints. Here are their results benchmarked in the fr
-Here is a sample of the performance obtainable by using PE lang G tuned further with [PLM-8B](../plm/README.md) using 36+1 image tiles / 32 video frames and Llama 3.1 8B as the decoder:
 | Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| PLM-8B | [PE-Lang-G14-448](https://huggingface.co/facebook/PE-Core-G14-448)* | 94.6 | 78.8 | 86.5 | 77.1 | 82.7 | 68.8 |
-\* This checkpoint was further aligned using tiling. We will release the tiling aligned checkpoint soon.
 See the paper for full performance evaluations and fair comparisons to other models.

 ## Perception Encoder: Language
+PE lang takes the strong language performance from the intermediate layers of PE core and further aligns for language modeling following [PLM](https://huggingface.co/papers/2504.13180). We specifically tuned PE lang to be versatile for any multimodal langugage modeling use case, including using different language model decoders (e.g., Llama / Qwen) and using different eval settings (e.g., native res / tiling). PE lang performs particularly well on OCR and document tasks.
+We release two PE Lang checkpoints, L14-448 and G14-448. Here are their results our benchmark setting with frozen encoder with 2.6M SFT datamix, using 448px _only_ (i.e., _with no tiling_) and Llama 3.1 8B as the decoder:
 | Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+Here is a sample of the performance obtainable by using PE Core G aligned further with [PLM-8B](https://huggingface.co/facebook/Perception-LM-8B) (*stage 3*) using 36+1 image tiles / 32 video frames with Llama 3.1 8B as the decoder:
 | Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| PLM-8B | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448)* | 94.6 | 78.8 | 86.5 | 77.1 | 82.7 | 68.8 |
+\* The PE-Core-G14-448 checkpoint was further trained using tiling. We will release the tiling aligned checkpoint soon.
 See the paper for full performance evaluations and fair comparisons to other models.