qwen2-vl-2b Image features and image tokens do not match
how to solve this problem
I am also getting, python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1688, in forward
[rank2]: raise ValueError(
[rank2]: ValueError: Video features and video tokens do not match: tokens: 0, features 936
and my user data has video.
unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
ValueError: Image features and image tokens do not match: tokens: 568, features 600
I solved this problem by significantly increasing the cutoff_len. I set cutoff_len to 100000
because the chat_template of base model does not support hugging face's instruct-version example, this is my version of chat_template:
{%- for message in messages -%}
{%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
{%- for content in message['content'] -%}
{%- if content['type'] == 'image' -%}
{{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
{%- elif content['type'] == 'video' -%}
{{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
{%- elif content['type'] == 'text' -%}
{{- content['text'] -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- endfor -%}
HOW TO USE:
# Load processor
processor = AutoProcessor.from_pretrained(
model_name,
padding_side="left",
use_fast=False
)
processor.chat_template = '''{%- for message in messages -%}
{%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
{%- for content in message['content'] -%}
{%- if content['type'] == 'image' -%}
{{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
{%- elif content['type'] == 'video' -%}
{{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
{%- elif content['type'] == 'text' -%}
{{- content['text'] -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- endfor -%}'''
I think you can set the token limit higher , maybe the placeholder get Truncation, for example the token limit for content is 4096 but your text gets 5024 (while the
is after 4096) then it shows image token 0, but there is another problem when I use system , user , assistant form , it will filter all data , I don't know why, but if you just use user + assistant , it is Ok
unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)ValueError: Image features and image tokens do not match: tokens: 568, features 600
I delete one bad sample. Then the code can run successfully. I don't know why. I check the bad sample, there is no different between it and other good samples.
just expand the token length