File size: 4,401 Bytes
5e96622 bb13899 5e96622 bb13899 5e96622 d0c6b3c 0dff6ac d0c6b3c bb13899 d0c6b3c bb13899 e424aef aa42d79 bb13899 aa42d79 bb13899 0dff6ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
---
title: README
emoji: 🌐
colorFrom: gray
colorTo: yellow
sdk: static
pinned: true
license: apache-2.0
short_description: Developing foundation models for low-resource languages.
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/62e1cc43926f4892a4ca2ff9/m3CZCMRAEqPpoPiZFE1zB.png
---
Polyglot is an initiative to close the linguistic divide in NLP by developing efficient and accessible foundation models for low-resource languages.
While recent breakthroughs in generative AI have been driven by large-scale foundation models, these advances have largely benefited high-resource languages, leaving many underrepresented languages behind. The current deep learning paradigm—heavily reliant on massive datasets and computing power—has unintentionally widened this gap, making it harder for speakers of low-resource languages to access and shape AI technologies that reflect their linguistic and cultural identities.
Polyglot addresses this imbalance by creating tools, models, and datasets that support open, sustainable, and inclusive AI development. We aim to empower researchers and communities working with low-resource languages through high-quality open-source resources, enabling them to build and fine-tune language models tailored to their needs.
## Recent Publications 📚
- **ViTucano: A Portuguese Vision Assistant** | [GitHub](https://github.com/Nkluge-correa/TinyLLaVA_Factory) | [Collection](https://huggingface.co/collections/TucanoBR/vitucano-v1-67804623a92cd2fabcafa0a3) |
- **Tucano: Advancing Neural Text Generation for Portuguese** | [GitHub](https://github.com/Nkluge-correa/Tucano) | [Collection](https://huggingface.co/collections/TucanoBR/tucano-670565e8c5325fb7f2da4361) | [Paper](https://www.sciencedirect.com/science/article/pii/S2666389925001734) |
- **TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese** | [GitHub](https://github.com/Nkluge-correa/TeenyTinyLlama) | [Collection](https://huggingface.co/collections/nicholasKluge/teenytinyllama-6582ea8129e72d1ea4d384f1) | [Paper](https://www.sciencedirect.com/science/article/pii/S2666827024000343) |
## News 🚀
- [24/07/2025] Peer-reviewed article "[Tucano: Advancing Neural Text Generation for Portuguese](https://doi.org/10.1016/j.patter.2025.101325)" is published in Patterns, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR).
- [13/01/2025] We release ViTucano, a pair of vision assistants natively pretrained in Portuguese ([ViTucano-1b5-v1](https://huggingface.co/TucanoBR/ViTucano-1b5-v1), [ViTucano-2b8-v1](https://huggingface.co/TucanoBR/ViTucano-2b8-v1)).
- [13/01/2025] We release the datasets used to pretrain and fine-tune the ViTucano models: [ViTucano-Pretrain](https://huggingface.co/datasets/TucanoBR/ViTucano-Pretrain) and [ViTucano-SFT](https://huggingface.co/datasets/TucanoBR/ViTucano-SFT).
- [29/11/2024] Tucano is mentioned on Deutsche Welle: "[Cientistas criam maior banco de dados em português para IA](https://www.dw.com/pt-br/pesquisadores-da-alemanha-criam-maior-banco-de-dados-p%C3%BAblico-em-portugu%C3%AAs-para-ia/a-70917082)".
- [27/11/2024] Tucano video presentation at the C4AI (USP) [available on [YouTube](https://www.youtube.com/watch?v=BscOHn54ld8)].
- [12/11/2024] "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)" is published as a preprint on ArXiv, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR).
## Community Contributions 🤝
- Demo on how to [run inference on ViTucano](https://colab.research.google.com/drive/110_Gtjgu4pldRQP864_Y-rSm2VhyW7Li).
- Demo on how to [run inference on Tucano](https://colab.research.google.com/drive/1Qf2DsFOFDA7RKkamI-tH3OregtOlZ8Cz).
- Demo on how to create a simple [Chat UI for Tucano](https://colab.research.google.com/drive/1fEW10CXksMfMv1veLr22OESwDs6e-W1b) using Gradio.
- [Tucano OpenVINO](https://huggingface.co/cabelo/Tucano-2b4-Instruct-fp16-ov) is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.
**Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.** |