Collections
Discover the best community collections!
Collections including paper arxiv:2309.09400
-
We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation
Paper • 2406.10561 • Published • 1 -
AtomGPT: Atomistic Generative Pre-trained Transformer for Forward and Inverse Materials Design
Paper • 2405.03680 • Published • 1 -
ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data
Paper • 2209.08203 • Published • 1 -
SeaLLMs -- Large Language Models for Southeast Asia
Paper • 2312.00738 • Published • 26
-
SeaLLMs/SeaExam
Viewer • Updated • 23.8k • 717 • 7 -
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Paper • 2407.19672 • Published • 56 -
SeaLLMs -- Large Language Models for Southeast Asia
Paper • 2312.00738 • Published • 26 -
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
Paper • 2407.19835 • Published • 21
-
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Paper • 2309.09400 • Published • 85 -
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
Paper • 2404.02893 • Published • 22 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 30 -
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Paper • 2404.12195 • Published • 12
-
A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
Paper • 2304.08999 • Published • 2 -
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Paper • 2309.09400 • Published • 85 -
Robust Open-Vocabulary Translation from Visual Text Representations
Paper • 2104.08211 • Published • 1 -
Poro 34B and the Blessing of Multilinguality
Paper • 2404.01856 • Published • 15
-
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper • 2307.08701 • Published • 23 -
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper • 2303.03915 • Published • 7 -
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper • 2309.04662 • Published • 23 -
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper • 2309.10818 • Published • 11