qwen2-vl-2b Image features and image tokens do not match

#22

by novicetyro - opened Dec 26, 2024

Discussion

novicetyro

Dec 26, 2024

how to solve this problem

amew0

Jan 17

I am also getting, python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1688, in forward
[rank2]: raise ValueError(
[rank2]: ValueError: Video features and video tokens do not match: tokens: 0, features 936
and my user data has video.

Egp567

Apr 12

unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

ValueError: Image features and image tokens do not match: tokens: 568, features 600

mishabar410

May 24

This comment has been hidden (marked as Off-Topic)

mishabar410

May 25

I solved this problem by significantly increasing the cutoff_len. I set cutoff_len to 100000

og006

Jun 27

because the chat_template of base model does not support hugging face's instruct-version example, this is my version of chat_template:

{%- for message in messages -%}
    {%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
        {%- for content in message['content'] -%}
            {%- if content['type'] == 'image' -%}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'video' -%}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'text' -%}
                {{- content['text'] -}}
            {%- endif -%}
        {%- endfor -%}
    {%- endif -%}
{%- endfor -%}

HOW TO USE:

# Load processor
        processor = AutoProcessor.from_pretrained(
            model_name,
            padding_side="left",
            use_fast=False
        )
        processor.chat_template = '''{%- for message in messages -%}
    {%- if message['content'] is sequence and message['content'][0]['type'] is defined -%}
        {%- for content in message['content'] -%}
            {%- if content['type'] == 'image' -%}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'video' -%}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' -}}
            {%- elif content['type'] == 'text' -%}
                {{- content['text'] -}}
            {%- endif -%}
        {%- endfor -%}
    {%- endif -%}
{%- endfor -%}'''

zlf-one

Jul 7

I think you can set the token limit higher , maybe the placeholder get Truncation, for example the token limit for content is 4096 but your text gets 5024 (while the is after 4096) then it shows image token 0, but there is another problem when I use system , user , assistant form , it will filter all data , I don't know why, but if you just use user + assistant , it is Ok

Egp567

Jul 11

unsloth_compiled_cache/unsloth_compiled_module_qwen2_vl.py:930, in Qwen2VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, **loss_kwargs)
928 n_image_features = image_embeds.shape[0]
929 if n_image_tokens != n_image_features:
--> 930 raise ValueError(
931 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
932 )
933 image_mask = (
934 (input_ids == self.config.image_token_id)
935 .unsqueeze(-1)
936 .expand_as(inputs_embeds)
937 .to(inputs_embeds.device)
938 )
939 image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)

ValueError: Image features and image tokens do not match: tokens: 568, features 600

I delete one bad sample. Then the code can run successfully. I don't know why. I check the bad sample, there is no different between it and other good samples.

zlf-one

Jul 11

just expand the token length

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment