Real-time video captioning powered by FastVLM
Transform images based on text instructions
Visualize patch similarity with DINOv3 feature maps.
Similarity, Classification