Pre-v4 readme and support files update

Browse files

Files changed (3) hide show

README.md +27 -8
src/data/generate_clean_corpus.sh +1 -0
src/data/generate_cyr_lat_pairs.py +29 -1

README.md CHANGED Viewed

@@ -82,7 +82,7 @@ This project:
 ---
-## 💻 Try it out
 Құшақтап тұрған бет арқылы тікелей пайдаланыңыз 🤗 Трансформерлер / Use directly via Hugging Face 🤗 Transformers:
@@ -100,26 +100,45 @@ print(output)
 ---
-## 📊 Data sources
-DalaT5 екі өте маңызды деректер жиынын пайдаланады / DalaT5 makes use of two very important datasets:
-- The first ~1.5 million records of the Kazakh subset of the CC100 dataset by [Conneau et al.](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
-- The Wikipedia dump of articles in the Kazakh language
 Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, DalaT5-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune DalaT5 yourself, please do the following:
 1. `get_data.sh` қабық сценарий файлын «src/data» қалтасында іске қосыңыз / Run the `get_data.sh` shell script file in the "src/data" folder
 2. Сол қалтадағы `generate_cyr_lat_pairs.py` файлын іске қосыңыз / Run the `generate_cyr_lat_pairs.py` file in the same folder
-3. Қазақ корпус файлын тазалау үшін `generate_clean_corpus.sh` іске қосыңыз / Run `generate_clean_corpus.sh` to clean the Kazakh corpus file
 Егер сіз Windows жүйесінде болсаңыз, `get_data.sh` сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл сізге `kazakh_latin_corpus.json` файлындағы бос немесе бос жолдарды сүзу үшін баламалы Windows функциясын табуды талап етеді. Оған қоса, `wikiextractor` бумасын алдын ала орнатқаныңызға сенімді болыңыз (нақты пайдаланылған нұсқаны `requirements.txt` файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_corpus.json` file. Additionally, be sure to install the `wikiextractor` package beforehand (the exact version used can be found in the `requirements.txt` file).
 ---
-## 📚 Credits
-Егер сіз DalaT5-ті туынды жұмыстарды зерттеуде қолдансаңыз, мыналарды келтіріңіз / If you use DalaT5 in research of derivative works, feel free to cite:
 ```
 @misc{crossroderick_dalat5_2025,

 ---
+## 💻 Байқап көріңіз / Try it out
 Құшақтап тұрған бет арқылы тікелей пайдаланыңыз 🤗 Трансформерлер / Use directly via Hugging Face 🤗 Transformers:
 ---
+## 🙏 Алғыс / Acknowledgements
+Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:
+- The first ~1.5 million records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
+- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
+- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
+---
+## 📊 Нақты баптау нұсқаулары / Fine-tuning instructions
 Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, DalaT5-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune DalaT5 yourself, please do the following:
 1. `get_data.sh` қабық сценарий файлын «src/data» қалтасында іске қосыңыз / Run the `get_data.sh` shell script file in the "src/data" folder
 2. Сол қалтадағы `generate_cyr_lat_pairs.py` файлын іске қосыңыз / Run the `generate_cyr_lat_pairs.py` file in the same folder
+3. Қазақ корпус файлын тазалау және деректер жинағын араластыру үшін `generate_clean_corpus.sh` іске қосыңыз / Run `generate_clean_corpus.sh` to clean the Kazakh corpus file and shuffle the dataset
+KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін «huggingface-cli» орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз. / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
 Егер сіз Windows жүйесінде болсаңыз, `get_data.sh` сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл сізге `kazakh_latin_corpus.json` файлындағы бос немесе бос жолдарды сүзу үшін баламалы Windows функциясын табуды талап етеді. Оған қоса, `wikiextractor` бумасын алдын ала орнатқаныңызға сенімді болыңыз (нақты пайдаланылған нұсқаны `requirements.txt` файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_corpus.json` file. Additionally, be sure to install the `wikiextractor` package beforehand (the exact version used can be found in the `requirements.txt` file).
 ---
+## 📋 Өзгеріс журналы / Changelog
+- **DalaT5 v1:** 13 сәуірде дәл реттелген, 13 сәуірде қолжетімді болды. Жаттығу үшін ~38 мың деректер жазбасы пайдаланылды. Дисперсиясы жоғары және үлгі сенімділігі төмен бастапқы нұсқа / Fine-tuned on April 13, made available on April 13. Used ~38k data records for training. Initial version with high variance and low model confidence
+- **DalaT5 v2:** 18 сәуірде дәл реттелген, 18 сәуірде қолжетімді болды. Жаттығу үшін ~1 миллион деректер жазбасы пайдаланылды. Деректердің көп болуының арқасында әлдеқайда жақсы өнімділікті көрсеткен екінші итерация / Fine-tuned on April 18, made available on April 18. Used ~1 million data records for training. Second iteration that exhibited much better performance owing to more data availability
+- **DalaT5 v3**: 20 сәуірде дәл реттелген, 20 сәуірде қолжетімді болды. Жаттығу үшін ~1,6 миллион деректер жазбасы пайдаланылды. Үшінші итерация одан әрі жақсартуларды, сондай-ақ белгілі бір дәрежеде семантикалық түсінуді көрсетті / Fine-tuned on April 20, made available on April 20. Used ~1.6 million data records for training. Third iteration that showed further improvements, as well as some degree of semantic understanding
+- **DalaT5 v4**: Нақты баптау 23 сәуірде басталады және сол күні қолжетімді болады. ~1,8 миллион жазбаны пайдалануға орнату (Wikipedia dump + CC100 + KazParC) / Fine-tuning to commence on April 23 and will be made available on the same day. Set to use ~1.8 million records (Wikipedia dump + CC100 + KazParC)
+---
+## 📚 Несиелер / Credits
+Егер сіз DalaT5-ті туынды жұмыстарды зерттеуде қолдансаңыз - біріншіден, рахмет. Екіншіден, егер сіз қаласаңыз, дәйексөз келтіріңіз / If you use DalaT5 in research of derivative works - first off, thank you. Secondly, should you be willing, feel free to cite:
 ```
 @misc{crossroderick_dalat5_2025,

src/data/generate_clean_corpus.sh CHANGED Viewed

	@@ -1 +1,2 @@

1	grep '\S' kazakh_latin_corpus.jsonl > clean_corpus.jsonl


1	+ shuf kazakh_latin_corpus.jsonl -o kazakh_latin_corpus.jsonl
2	grep '\S' kazakh_latin_corpus.jsonl > clean_corpus.jsonl

src/data/generate_cyr_lat_pairs.py CHANGED Viewed

@@ -2,6 +2,7 @@ import os
 import json
 from tqdm import tqdm
 from itertools import islice
 # Kazakh Cyrillic character to the Kazakh Latin character mapping from 2021 onwards
@@ -127,4 +128,31 @@ with open(output_path, 'w', encoding = "utf-8") as out_file:
                 except Exception as e:
                     tqdm.write(f"Skipping due to: {e}")
-                    continue

 import json
 from tqdm import tqdm
 from itertools import islice
+from datasets import load_dataset
 # Kazakh Cyrillic character to the Kazakh Latin character mapping from 2021 onwards
                 except Exception as e:
                     tqdm.write(f"Skipping due to: {e}")
+                    continue
+    # Third step: process the raw, Kazakh-centred part of the "KazParC" dataset
+    print("Loading 'KazParC' dataset...")
+    kazparc = load_dataset("issai/kazparc", "kazparc_raw", split = "train")
+    with open(output_path, 'a', encoding = "utf-8") as out_file:
+        for entry in tqdm(kazparc, desc = "Entries in KazParC"):
+            try:
+                if "kk" in entry and isinstance(entry["kk"], str):
+                    cyr_text = entry["kk"].strip()
+                    lat_text = convert_to_latin(cyr_text).strip()
+                    if cyr_text and lat_text:
+                        obj = {
+                            "transliteration": {
+                                "src": cyr_text,
+                                "tgt": lat_text
+                            }
+                        }
+                        out_file.write(json.dumps(obj, ensure_ascii = False) + "\n")
+            except Exception as e:
+                tqdm.write(f"Skipping due to: {e}")
+                continue