Papers
arxiv:2312.05349

PixLore: A Dataset-driven Approach to Rich Image Captioning

Published on Dec 8, 2023
Authors:

Abstract

In the domain of vision-language integration, generating detailed image captions poses a significant challenge due to the lack of a curated and rich dataset. This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU. Our approach, which involves training on a carefully assembled dataset from state-of-the-art Computer Vision models combined and augmented by ChatGPT, addresses the question of whether intricate image understanding can be achieved with an ensemble of smaller-scale models. Comparative evaluations against major models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite having considerably fewer parameters, is rated higher than the existing State-of-the-Art models in over half of the assessments. This research not only presents a groundbreaking approach but also highlights the importance of well-curated datasets in enhancing the performance of smaller models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2312.05349 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.05349 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.