qwen2-vl-2b Image features and image tokens do not match

#22
by novicetyro - opened

how to solve this problem

I am also getting, python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1688, in forward
[rank2]: raise ValueError(
[rank2]: ValueError: Video features and video tokens do not match: tokens: 0, features 936
and my user data has video.

unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

ValueError: Image features and image tokens do not match: tokens: 568, features 600

This comment has been hidden (marked as Off-Topic)

I solved this problem by significantly increasing the cutoff_len. I set cutoff_len to 100000

because the chat_template of base model does not support hugging face's instruct-version example, this is my version of chat_template:

{%- for message in messages -%}
    {%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
        {%- for content in message['content'] -%}
            {%- if content['type'] == 'image' -%}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'video' -%}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'text' -%}
                {{- content['text'] -}}
            {%- endif -%}
        {%- endfor -%}
    {%- endif -%}
{%- endfor -%}

HOW TO USE:

# Load processor
        processor = AutoProcessor.from_pretrained(
            model_name,
            padding_side="left",
            use_fast=False
        )
        processor.chat_template = '''{%- for message in messages -%}
    {%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
        {%- for content in message['content'] -%}
            {%- if content['type'] == 'image' -%}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'video' -%}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'text' -%}
                {{- content['text'] -}}
            {%- endif -%}
        {%- endfor -%}
    {%- endif -%}
{%- endfor -%}'''

I think you can set the token limit higher , maybe the placeholder get Truncation, for example the token limit for content is 4096 but your text gets 5024 (while the is after 4096) then it shows image token 0, but there is another problem when I use system , user , assistant form , it will filter all data , I don't know why, but if you just use user + assistant , it is Ok

unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

ValueError: Image features and image tokens do not match: tokens: 568, features 600

I delete one bad sample. Then the code can run successfully. I don't know why. I check the bad sample, there is no different between it and other good samples.

just expand the token length

Sign up or log in to comment