## Gemma 3 For Video Understanding

Did you know that you can use Gemma 3 for video understanding?

Gemma 3 interleaving image and text allow for this. You can simply interleave timestamps and frames asking the model to summarize the events in the videos. Here's a small notebook to do so.

Install the release for Gemma 3 and login to access the model.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [None]:
!huggingface-cli login

Let's load the model.

In [None]:
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    ckpt, device_map="auto", torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(ckpt)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Download the video and downsample the frames from the video.

In [None]:
!wget https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4

In [None]:
import cv2
from PIL import Image
import numpy as np

def downsample_video(video_path):
    vidcap = cv2.VideoCapture(video_path)
    total_frames = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = vidcap.get(cv2.CAP_PROP_FPS)

    frames = []
    frame_indices = np.linspace(0, total_frames - 1, 10, dtype=int)

    for i in frame_indices:
        vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
        success, image = vidcap.read()
        if success:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert from BGR to RGB
            pil_image = Image.fromarray(image)
            timestamp = round(i / fps, 2)
            frames.append((pil_image, timestamp))

    vidcap.release()
    return frames


In [None]:
frames = downsample_video("cats_1.mp4")

In [None]:
frames

[(<PIL.Image.Image image mode=RGB size=1920x1080>, 0.0),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 0.63),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 1.3),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 1.94),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 2.6),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 3.24),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 3.9),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 4.54),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 5.21),
 (<PIL.Image.Image image mode=RGB size=1920x1080>, 5.87)]

Here's our system prompt and the instruction. We will add frames and images on top of it.

In [None]:
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },

    {
        "role": "user",
        "content": [
            {"type": "text", "text": f"What is happening in this video? Summarize the events."}]
    }
]

In [None]:
messages[1]["content"][0]

{'type': 'text',
 'text': 'What is happening in this video? Summarize the events.'}

In [None]:
for frame in frames:
    image, timestamp = frame
    messages[1]["content"].append({"type": "text", "text": f"Frame {timestamp}:"})
    image.save(f"image_{timestamp}.png")
    messages[1]["content"].append({"type": "image", "url": f"image_{timestamp}.png"})

In [None]:
messages

[{'role': 'system',
  'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]},
 {'role': 'user',
  'content': [{'type': 'text',
    'text': 'What is happening in this video? Summarize the events.'},
   {'type': 'text', 'text': 'Frame 0.0:'},
   {'type': 'image', 'url': 'image_0.0.png'},
   {'type': 'text', 'text': 'Frame 0.63:'},
   {'type': 'image', 'url': 'image_0.63.png'},
   {'type': 'text', 'text': 'Frame 1.3:'},
   {'type': 'image', 'url': 'image_1.3.png'},
   {'type': 'text', 'text': 'Frame 1.94:'},
   {'type': 'image', 'url': 'image_1.94.png'},
   {'type': 'text', 'text': 'Frame 2.6:'},
   {'type': 'image', 'url': 'image_2.6.png'},
   {'type': 'text', 'text': 'Frame 3.24:'},
   {'type': 'image', 'url': 'image_3.24.png'},
   {'type': 'text', 'text': 'Frame 3.9:'},
   {'type': 'image', 'url': 'image_3.9.png'},
   {'type': 'text', 'text': 'Frame 4.54:'},
   {'type': 'image', 'url': 'image_4.54.png'},
   {'type': 'text', 'text': 'Frame 5.21:'},
   {'type': 'image', '

Preprocess our input and infer.

In [None]:
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

In [None]:
input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=500, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Here's a summary of what's happening in the video:

The video features a beautiful, fluffy cat. Throughout the sequence, the cat is lying down, mostly looking upwards. It appears to be grooming itself, licking its nose and paws repeatedly. The cat has a relaxed and content demeanor, enjoying a moment of self-care.
