Engage in multi-modal conversations with images and videos
Crop and align video with audio
Identify key entities in text
Convert text to speech in multiple languages
Media understanding