Generate text based on user input with assistance
Instruction-tuned model for a range of vision-language tasks