reach-vb HF Staff commited on
Commit
8e093cf
·
verified ·
1 Parent(s): fbe8feb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md CHANGED
@@ -51,6 +51,61 @@ python predict.py --model-path /path/to/checkpoint-dir \
51
  --prompt "Describe the image."
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## Citation
56
  If you found this model useful, please cite the following paper:
 
51
  --prompt "Describe the image."
52
  ```
53
 
54
+ ### Run inference with Transformers (Remote Code)
55
+ To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:
56
+
57
+ ```python
58
+ import torch
59
+ from PIL import Image
60
+ from transformers import AutoTokenizer, AutoModelForCausalLM
61
+
62
+ MID = "apple/FastVLM-0.5B"
63
+ IMAGE_TOKEN_INDEX = -200 # what the model code looks for
64
+
65
+ # Load
66
+ tok = AutoTokenizer.from_pretrained(MID, trust_remote_code=True)
67
+ model = AutoModelForCausalLM.from_pretrained(
68
+ MID,
69
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
70
+ device_map="auto",
71
+ trust_remote_code=True,
72
+ )
73
+
74
+ # Build chat -> render to string (not tokens) so we can place <image> exactly
75
+ messages = [
76
+ {"role": "user", "content": "<image>\nDescribe this image in detail."}
77
+ ]
78
+ rendered = tok.apply_chat_template(
79
+ messages, add_generation_prompt=True, tokenize=False
80
+ )
81
+
82
+ pre, post = rendered.split("<image>", 1)
83
+
84
+ # Tokenize the text *around* the image token (no extra specials!)
85
+ pre_ids = tok(pre, return_tensors="pt", add_special_tokens=False).input_ids
86
+ post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids
87
+
88
+ # Splice in the IMAGE token id (-200) at the placeholder position
89
+ img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
90
+ input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
91
+ attention_mask = torch.ones_like(input_ids, device=model.device)
92
+
93
+ # Preprocess image via the model's own processor
94
+ img = Image.open("test-2.jpg").convert("RGB")
95
+ px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]
96
+ px = px.to(model.device, dtype=model.dtype)
97
+
98
+ # Generate
99
+ with torch.no_grad():
100
+ out = model.generate(
101
+ inputs=input_ids,
102
+ attention_mask=attention_mask,
103
+ images=px,
104
+ max_new_tokens=128,
105
+ )
106
+
107
+ print(tok.decode(out[0], skip_special_tokens=True))
108
+ ```
109
 
110
  ## Citation
111
  If you found this model useful, please cite the following paper: