File size: 2,086 Bytes
89ecc6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd2da90
 
 
854d203
89ecc6d
 
 
 
 
09984ee
89ecc6d
 
09984ee
 
 
 
 
 
 
 
 
 
 
89ecc6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
language:
- hu
- en
- zh
tags:
- puli
---

# PULI Trio Q 7B base (7.62B billion parameter)


  - Trained with LLaMA-Factory [github](https://github.com/hiyouga/LLaMA-Factory)
  - The [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) model were continual pretrained on Hungarian dataset

## Dataset for continued pretraining

- Hungarian (8.08 billion words): documents (763K) that exceed 5000 words in length + Hungarian Wikipedia
- English: Long Context QA (2 billion words), BookSum (78 million words)
- Chinese (3 billion Chinese characters): Wudao

- The training was completed using a Hungarian-only dataset:
  - 626 million Hungarian words (**1 epoch**): Hungarian Wikipedia + News articles

## Limitations

- max_seq_length = 32 768


## Usage with pipeline

```python
from transformers import pipeline, Qwen2ForCausalLM, AutoTokenizer

model = Qwen2ForCausalLM.from_pretrained("NYTK/PULI-Trio-Q")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-Trio-Q")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=0)

print(generator(prompt, max_new_tokens=30)[0]["generated_text"])
```

## Citation
If you use this model, please cite the following paper:

```
@inproceedings {yang-llumix-llama,
    title = {PULI Chat: Our First Hungarian Conversational Model},
	booktitle = {International Conference on Formal Methods and Foundations of Artificial Intelligence},
	year = {2025},
	publisher = {Eszterházy Károly Catholic University},
	address = {Eger, Hungary},
	author = {Yang, Zijian Győző and Bánfi, Ágnes and Dodé, Réka and Ferenczi, Gergő and Földesi, Flóra and Hatvani, Péter and Héja, Enikő and Lengyel, Mariann and  Madarász, Gábor and Osváth, Mátyás and Sárossy, Bence and Varga, Kristóf and Váradi, Tamás and Prószéky, Gábor and Ligeti-Nagy, Noémi},
	pages = {1--3},
    pubstate={accepted abstract},
    url ={https://uni-eszterhazy.hu/api/media/file/7f9158bd443acc29dbd2a211971fe8677768257c}
}
```