# LLMEyeCap: Giving Eyes to Large Language Models ## Model Description LLMEyeCap is an innovative Novel Object Captioning model aimed at enhancing Large Language Models (LLMs) with vision capabilities. This project leverages a blend of cutting-edge models and techniques to detect novel objects in images, identify their bounding boxes, and generate insightful captions for them. One of the core innovations is the replacement of traditional classification layers with text generation mechanisms. This novel approach addresses the issue of catastrophic forgetting, enabling the model to learn new objects without unlearning previous ones. Furthermore, the model connects the latent space of the visual data to the hidden dimensions of an LLM's decoder. This makes it possible to train the model on unsupervised video datasets, opening up a plethora of potential applications. ### Features - **Novel Object Captioning + Bounding Boxes** - **ResNet50 as a backbone** - **Customized DETR model for bounding box detection** - **BERT Tokenizer and GPT-2 for text generation** - **Replacing classification layers with Transformer Decoder Object Captioning layers** ## Training Data The model was trained on the following datasets: - VOC Dataset - COCO 80 - COCO 91 Training was carried out for 30 epochs. ## Usage Here's how to use this model for object captioning: \`\`\`python model = LLMEyeCapModel(num_queries=NUM_QUERIES,vocab_size=vocab_size,pad_token=PAD_TOKEN) model = model.to(device) state_dict = torch.load("LLMEyeCap_01.bin") model.load_state_dict(state_dict) def display_image_ds(image_path, bb, ll): #print(len(boxes),len(boxes[0]),len(labels),len(labels[0])) image = Image.open(image_path).convert('RGB') fig, ax = plt.subplots(1, 1, figsize=(12, 20)) # Set the figure size ax.imshow(image) # Draw bounding boxes and labels for box, label in zip(bb[0], cc[0]): (x, y, w, h) = box if (x==0 and y==0 and w==0 and h==0) or label=='na': continue x*=image.width y*=image.height w*=image.width h*=image.height rect = patches.Rectangle((x-w/2, y-h/2), w, h, linewidth=2, edgecolor='r', facecolor='none') ax.add_patch(rect) label_str = tokenizer.decode(label, skip_special_tokens=True) #print("*",label_str,"*") if label_str != 'na': ax.text(x-w/2, y-h/2, label_str, color='r', bbox=dict(facecolor='white', edgecolor='r', pad=2),fontsize=18) image_paths=["../data/coco91/train2017/000000291557.jpg", "../data/coco91/train2017/000000436027.jpg"] for im in image_paths: bb,cc= model.generate_caption( im, tokenizer, max_length=20,pad_sos=PAD_SOS) display_image_ds(im, bb.to('cpu'), cc.to('cpu')) \`\`\` ![image/png](https://cdn-uploads.huggingface.co/production/uploads/645364cbf666f76551f93111/D-0KXDrBzuRCjeF3WcLY3.png) ### Results . See tuto.ipynb file ## Limitations and Future Work This 0.1 version is a stand alone model for captiong objects on images. It can be uses as it or trained on new objects without "catastrophic forgetting". Coming the 0.2 version with latent space to connect to hidden dims of LLMs. Again this model is still in the development phase and we're actively seeking contributions and ideas to enhance its capabilities. If you're interested in contributing, whether it's through code, ideas, or data, we'd love to hear from you. ## Authors Imed MAGROUNE.