| Image processors preprocess vision inputs, feature extractors preprocess audio inputs, and a processor handles multimodal inputs. | 
| Image processors preprocess vision inputs, feature extractors preprocess audio inputs, and a processor handles multimodal inputs. |