Image Feature Extraction
PerceptionEncoder
janghyuncho7 commited on
Commit
59f4b4f
·
verified ·
1 Parent(s): fb60305

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -17,9 +17,9 @@ are not at the output of the network](https://ai.meta.com/research/publications/
17
 
18
 
19
  ## Perception Encoder: Language
20
- PE lang takes the strong language performance from the intermediate layers of PE core and aligns it to the end for use with large language models. We specifically tuned PE lang to be versatile for any multimodal langugage modeling use case, including using different language model decoders (e.g., Llama / Qwen) and using different eval settings (e.g., native res / tiling). PE lang performs particularly well on OCR and document tasks.
21
 
22
- We release two PE Lang checkpoints. Here are their results benchmarked in the frozen encoder [PLM-8B](../plm/README.md) benchmark SFT using 448px _only_ (i.e., _with no tiling_) and Llama 3.1 8B as the decoder:
23
 
24
  | Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
25
  |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -28,13 +28,13 @@ We release two PE Lang checkpoints. Here are their results benchmarked in the fr
28
 
29
 
30
 
31
- Here is a sample of the performance obtainable by using PE lang G tuned further with [PLM-8B](../plm/README.md) using 36+1 image tiles / 32 video frames and Llama 3.1 8B as the decoder:
32
 
33
  | Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
34
  |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
35
- | PLM-8B | [PE-Lang-G14-448](https://huggingface.co/facebook/PE-Core-G14-448)* | 94.6 | 78.8 | 86.5 | 77.1 | 82.7 | 68.8 |
36
 
37
- \* This checkpoint was further aligned using tiling. We will release the tiling aligned checkpoint soon.
38
 
39
  See the paper for full performance evaluations and fair comparisons to other models.
40
 
 
17
 
18
 
19
  ## Perception Encoder: Language
20
+ PE lang takes the strong language performance from the intermediate layers of PE core and further aligns for language modeling following [PLM](https://huggingface.co/papers/2504.13180). We specifically tuned PE lang to be versatile for any multimodal langugage modeling use case, including using different language model decoders (e.g., Llama / Qwen) and using different eval settings (e.g., native res / tiling). PE lang performs particularly well on OCR and document tasks.
21
 
22
+ We release two PE Lang checkpoints, L14-448 and G14-448. Here are their results our benchmark setting with frozen encoder with 2.6M SFT datamix, using 448px _only_ (i.e., _with no tiling_) and Llama 3.1 8B as the decoder:
23
 
24
  | Encoder | Checkpoint | Doc VQA (val) | InfoQA (val) | TextVQA | MVBench | PerceptionTest (val) | EgoSchema (val) |
25
  |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 
28
 
29
 
30
 
31
+ Here is a sample of the performance obtainable by using PE Core G aligned further with [PLM-8B](https://huggingface.co/facebook/Perception-LM-8B) (*stage 3*) using 36+1 image tiles / 32 video frames with Llama 3.1 8B as the decoder:
32
 
33
  | Model | Encoder | Doc VQA (test) | InfoQA (test) | TextVQA | MVBench | PerceptionTest (test) | EgoSchema (test) |
34
  |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
35
+ | PLM-8B | [PE-Core-G14-448](https://huggingface.co/facebook/PE-Core-G14-448)* | 94.6 | 78.8 | 86.5 | 77.1 | 82.7 | 68.8 |
36
 
37
+ \* The PE-Core-G14-448 checkpoint was further trained using tiling. We will release the tiling aligned checkpoint soon.
38
 
39
  See the paper for full performance evaluations and fair comparisons to other models.
40