Evaluate and generate text based on images and videos
a tiny vision language model
Generate text based on input prompts