|
--- |
|
title: README |
|
emoji: 🌐 |
|
colorFrom: gray |
|
colorTo: yellow |
|
sdk: static |
|
pinned: true |
|
license: apache-2.0 |
|
short_description: Developing foundation models for low-resource languages. |
|
thumbnail: >- |
|
https://cdn-uploads.huggingface.co/production/uploads/62e1cc43926f4892a4ca2ff9/m3CZCMRAEqPpoPiZFE1zB.png |
|
--- |
|
|
|
Polyglot is an initiative to close the linguistic divide in NLP by developing efficient and accessible foundation models for low-resource languages. |
|
|
|
While recent breakthroughs in generative AI have been driven by large-scale foundation models, these advances have largely benefited high-resource languages, leaving many underrepresented languages behind. The current deep learning paradigm—heavily reliant on massive datasets and computing power—has unintentionally widened this gap, making it harder for speakers of low-resource languages to access and shape AI technologies that reflect their linguistic and cultural identities. |
|
|
|
Polyglot addresses this imbalance by creating tools, models, and datasets that support open, sustainable, and inclusive AI development. We aim to empower researchers and communities working with low-resource languages through high-quality open-source resources, enabling them to build and fine-tune language models tailored to their needs. |
|
|
|
## Recent Publications 📚 |
|
|
|
- **ViTucano: A Portuguese Vision Assistant** | [GitHub](https://github.com/Nkluge-correa/TinyLLaVA_Factory) | [Collection](https://huggingface.co/collections/TucanoBR/vitucano-v1-67804623a92cd2fabcafa0a3) | |
|
- **Tucano: Advancing Neural Text Generation for Portuguese** | [GitHub](https://github.com/Nkluge-correa/Tucano) | [Collection](https://huggingface.co/collections/TucanoBR/tucano-670565e8c5325fb7f2da4361) | [Paper](https://www.sciencedirect.com/science/article/pii/S2666389925001734) | |
|
- **TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese** | [GitHub](https://github.com/Nkluge-correa/TeenyTinyLlama) | [Collection](https://huggingface.co/collections/nicholasKluge/teenytinyllama-6582ea8129e72d1ea4d384f1) | [Paper](https://www.sciencedirect.com/science/article/pii/S2666827024000343) | |
|
|
|
## News 🚀 |
|
|
|
- [24/07/2025] Peer-reviewed article "[Tucano: Advancing Neural Text Generation for Portuguese](https://doi.org/10.1016/j.patter.2025.101325)" is published in Patterns, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR). |
|
- [13/01/2025] We release ViTucano, a pair of vision assistants natively pretrained in Portuguese ([ViTucano-1b5-v1](https://huggingface.co/TucanoBR/ViTucano-1b5-v1), [ViTucano-2b8-v1](https://huggingface.co/TucanoBR/ViTucano-2b8-v1)). |
|
- [13/01/2025] We release the datasets used to pretrain and fine-tune the ViTucano models: [ViTucano-Pretrain](https://huggingface.co/datasets/TucanoBR/ViTucano-Pretrain) and [ViTucano-SFT](https://huggingface.co/datasets/TucanoBR/ViTucano-SFT). |
|
- [29/11/2024] Tucano is mentioned on Deutsche Welle: "[Cientistas criam maior banco de dados em português para IA](https://www.dw.com/pt-br/pesquisadores-da-alemanha-criam-maior-banco-de-dados-p%C3%BAblico-em-portugu%C3%AAs-para-ia/a-70917082)". |
|
- [27/11/2024] Tucano video presentation at the C4AI (USP) [available on [YouTube](https://www.youtube.com/watch?v=BscOHn54ld8)]. |
|
- [12/11/2024] "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)" is published as a preprint on ArXiv, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR). |
|
|
|
## Community Contributions 🤝 |
|
|
|
- Demo on how to [run inference on ViTucano](https://colab.research.google.com/drive/110_Gtjgu4pldRQP864_Y-rSm2VhyW7Li). |
|
- Demo on how to [run inference on Tucano](https://colab.research.google.com/drive/1Qf2DsFOFDA7RKkamI-tH3OregtOlZ8Cz). |
|
- Demo on how to create a simple [Chat UI for Tucano](https://colab.research.google.com/drive/1fEW10CXksMfMv1veLr22OESwDs6e-W1b) using Gradio. |
|
- [Tucano OpenVINO](https://huggingface.co/cabelo/Tucano-2b4-Instruct-fp16-ov) is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology. |
|
|
|
**Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.** |