Visualize VLM evaluations across datasets
Visualize model outputs for AITW benchmark
Visualize benchmark datasets with samples and descriptions