jz2023 commited on
Commit
923ce8d
·
verified ·
1 Parent(s): ee1fdb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -25
README.md CHANGED
@@ -5,7 +5,7 @@ license: apache-2.0
5
  # Model Details
6
 
7
  Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
8
- are hidden inside the network](https://TBC)".
9
 
10
  **Model Developer**: Meta
11
 
@@ -16,22 +16,21 @@ are hidden inside the network](https://TBC)".
16
 
17
  | Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
18
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
19
- | **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 384 | 16 | 32 |
20
- | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 384 | 16 | 32 |
21
  | **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
22
  | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
23
- | **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 392 | 14 | 72 |
24
- | | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 392 | 14 | 72 |
25
 
26
 
27
  # How to use
28
 
29
  ## PE codebase
30
- We provide the pretraining code in https://github.com/meta-ai-research-fair/occhi.git
31
-
32
  ```shell
33
- git clone https://github.com/meta-ai-research-fair/occhi.git
34
- cd occhi
35
  conda create --name occhi-env python=3.12
36
  conda activate occhi-env
37
  # Install PyTorch
@@ -46,8 +45,10 @@ pip install -e .
46
  import torch
47
  from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
48
  from PIL import Image
49
- model_name = 'PEv1-G-14'
50
- pretrained='ckpts/pev1_gs14_448_rc2.pt'
 
 
51
  model, _, preprocess = create_model_and_transforms(
52
  model_name,
53
  pretrained=pretrained,
@@ -56,40 +57,31 @@ model = model.cuda()
56
  tokenizer = get_tokenizer(model_name)
57
  image = preprocess(Image.open("docs/cat.png")).unsqueeze(0).cuda()
58
  text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()
 
59
  with torch.no_grad(), torch.autocast("cuda"):
60
  image_features = model.encode_image(image)
61
  text_features = model.encode_text(text)
62
  image_features /= image_features.norm(dim=-1, keepdim=True)
63
  text_features /= text_features.norm(dim=-1, keepdim=True)
64
  text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
 
65
  print("Label probs:", text_probs) # prints: [[0.0, 0.0, 1.0]]
66
  ```
67
-
68
  You can find more details in the GitHub repo.
69
-
70
-
71
  # Evaluation
72
- We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks
73
-
74
  Here is the table in Markdown format:
75
-
76
  ## Zero-Shot Image Results
77
-
78
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
79
-
80
  ## Zero-Shot Video Results
81
-
82
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
83
-
84
-
85
  # Citation
86
-
87
  If you find our code useful for your research, please consider citing:
88
 
89
  @article{PE,
90
- title={Perception Encoder},
91
  author={},
92
  journal={arXiv:xxx.xxxxx},
93
  year={2025}
94
  }
95
-
 
5
  # Model Details
6
 
7
  Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
8
+ are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".
9
 
10
  **Model Developer**: Meta
11
 
 
16
 
17
  | Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
18
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
19
+ | **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224 | 16 | 32 |
20
+ | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 224 | 16 | 32 |
21
  | **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
22
  | | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
23
+ | **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448 | 14 | 72 |
24
+ | | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 448 | 14 | 72 |
25
 
26
 
27
  # How to use
28
 
29
  ## PE codebase
30
+ We provide the pretraining code in https://github.com/facebookresearch/perception_models
 
31
  ```shell
32
+ git clone https://github.com/facebookresearch/perception_models.git
33
+ cd perception_models
34
  conda create --name occhi-env python=3.12
35
  conda activate occhi-env
36
  # Install PyTorch
 
45
  import torch
46
  from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
47
  from PIL import Image
48
+
49
+ model_name = 'PEv1-L14-336'
50
+ pretrained='PATH_TO_PE_Core_L14_336'
51
+
52
  model, _, preprocess = create_model_and_transforms(
53
  model_name,
54
  pretrained=pretrained,
 
57
  tokenizer = get_tokenizer(model_name)
58
  image = preprocess(Image.open("docs/cat.png")).unsqueeze(0).cuda()
59
  text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()
60
+
61
  with torch.no_grad(), torch.autocast("cuda"):
62
  image_features = model.encode_image(image)
63
  text_features = model.encode_text(text)
64
  image_features /= image_features.norm(dim=-1, keepdim=True)
65
  text_features /= text_features.norm(dim=-1, keepdim=True)
66
  text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
67
+
68
  print("Label probs:", text_probs) # prints: [[0.0, 0.0, 1.0]]
69
  ```
 
70
  You can find more details in the GitHub repo.
 
 
71
  # Evaluation
72
+ We evaluate the pretrained PE models on Zero-shot Common Sense Reasoning tasks
 
73
  Here is the table in Markdown format:
 
74
  ## Zero-Shot Image Results
 
75
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
 
76
  ## Zero-Shot Video Results
 
77
  <img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
 
 
78
  # Citation
 
79
  If you find our code useful for your research, please consider citing:
80
 
81
  @article{PE,
82
+ title={Perception Encoder: The best visual embeddings are not at the output of the network},
83
  author={},
84
  journal={arXiv:xxx.xxxxx},
85
  year={2025}
86
  }
87
+