File size: 14,028 Bytes
0e7be3d
 
 
 
 
 
 
 
0b6549a
0e7be3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d71577
0e7be3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
025ab01
0e7be3d
 
 
 
 
 
 
 
 
4c3db4c
0e7be3d
 
 
935fbd7
0e7be3d
935fbd7
 
 
 
 
 
 
 
 
 
 
 
 
0e7be3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
935fbd7
0e7be3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
license: apache-2.0
datasets:
- Nikity/Kyoto-Corpus
language:
- en
base_model:
- Nikity/lille-130m-base
base_model_relation: finetune
model-index:
- name: lille-130m-instruct
  results:
  - task:
      type: text-generation
    dataset:
      name: arc_challenge
      type: arc_challenge
    metrics:
    - name: ARC (Challenge)
      type: Accuracy
      value: 15.05
  - task:
      type: text-generation
    dataset:
      name: arc_easy
      type: arc_easy
    metrics:
    - name: ARC (Easy)
      type: Accuracy
      value: 21.4
  - task:
      type: text-generation
    dataset:
      name: gpqa
      type: gpqa
    metrics:
    - name: GPQA
      type: Accuracy
      value: 12.73
  - task:
      type: text-generation
    dataset:
      name: gsm8k
      type: gsm8k
    metrics:
    - name: GSM8K
      type: Accuracy
      value: 7.73
  - task:
      type: text-generation
    dataset:
      name: ifeval
      type: ifeval
    metrics:
    - name: IFEVAL
      type: Accuracy
      value: 9.01
  - task:
      type: text-generation
    dataset:
      name: math
      type: math
    metrics:
    - name: MATH (Level 5)
      type: Accuracy
      value: 1.91
  - task:
      type: text-generation
    dataset:
      name: mmlu
      type: mmlu
    metrics:
    - name: MMLU
      type: Accuracy
      value: 22.76
  - task:
      type: text-generation
    dataset:
      name: mt_bench
      type: mt_bench
    metrics:
    - name: MT-Bench
      type: Accuracy
      value: 8.2
  - task:
      type: text-generation
    dataset:
      name: truthful_qa
      type: truthful_qa
    metrics:
    - name: TruthfulQA
      type: Accuracy
      value: 9.06
---

# Lille 130M Instruct

![Lille-Header](assets/lille-header.png)

> **You are currently viewing the `lille-130m-instruct` model card.**
>
> View the base model here: **[Nikity/lille-130m-base](https://huggingface.co/Nikity/lille-130m-base)**

## Table of Contents
1. [Model Summary](#-model-summary)
2. [Evaluation](#-evaluation)
3. [How to Use](#-how-to-use)
4. [Training and Finetuning](#-training-and-finetuning)
5. [Training Details](#-training-details)
6. [Limitations](#-limitations)
7. [The Truly Open-Source Stack](#-the-truly-open-source-repos)
8. [License](#-license)
9. [Citation](#-citation)

## โœจ Model Summary

**Lille** is a 130-million-parameter language model built from the ground up as a core component of a completely open-source deep learning stack. The name Lille reflects both its compact size and strong capabilities - capturing the idea that less can be more. It draws on the Norwegian word lille (โ€˜smallโ€™ or โ€˜littleโ€™) as well as the French city Lille, giving it both meaning and place. It was trained using a custom tokenizer, a curated dataset, and a memory-efficient optimizer, all of which are publicly available.

The model comes in two versions:
*   **`Lille-130M-Base`**: The foundational model pretrained on 4.27 billion of tokens from the [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset. A post-processing step to only include the highest quality of content was added. It has strong general knowledge and text completion abilities.
*   **`Lille-130M-Instruct`**: The instruction-tuned version, fine-tuned on the **[Kyoto-Corpus](https://huggingface.co/datasets/Nikity/Kyoto-Corpus)**. It excels at following user commands, engaging in chat, and performing a variety of instruction-based tasks.

The model architecture is a modern Transformer decoder featuring Grouped-Query Attention (GQA), RoPE, and RMSNorm, making it efficient and performant for its size.

*Note on parameter count: While the model name is `130M` for simplicity, the actual parameter count is 127.17 million.*

## ๐Ÿ“Š Evaluation

All evaluations were conducted using **[simple-eval](https://github.com/Nikityyy/simple-eval)**, our open-source evaluation framework. Benchmarks are run in a zero-shot setting unless specified otherwise.

#### `Lille-130M-Instruct`

![Evaluations](assets/evaluations.png)

> Evaluations for other LLMs are sourced from the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> or their respective model cards when benchmark data is unavailable. For Lille 130M Instruct, evaluations are performed using <a href="https://github.com/Nikityyy/simple-eval">simple-eval</a>. ARC-C and ARC-E for Smollm2 are also evaluated using <a href="https://github.com/Nikityyy/simple-eval">simple-eval</a>.

## ๐Ÿš€ How to Use

There are several ways to use the Lille models, from easy-to-use graphical interfaces to advanced programmatic control.

### 1. LM Studio (Easiest for Chat)

LM Studio provides a simple graphical interface to run LLMs on your local machine. It's the easiest way to start chatting with Lille.

1.  **Download & Install:** Get [LM Studio](https://lmstudio.ai/) for your operating system (Windows, Mac, or Linux).
2.  **Search for the Model:** Open LM Studio and click the **magnifying glass** icon on the left.
3.  **Find Lille:** In the search bar, type `Lille` or `Nikity`. You will find the models I have uploaded.
4.  **Download a GGUF:** On the right-hand side, you'll see a list of GGUF files. Download a recommended version like `lille-130m-instruct-f16.gguf`.
5.  **Chat:** Click the **speech bubble** icon on the left. At the top, select the model you just downloaded. Now you can start a conversation!

### 2. SimpleAI SDK (Recommended for Programmatic Use)

The easiest way to use Lille programmatically is with the `simpleai-sdk`, which handles all the boilerplate for you and provides a simple, high-level API for both Hugging Face and ONNX backends.

```bash
pip install simpleai-sdk
```

```python
from simple_ai import lille

# This will download and cache the model on first run.
# Specify the model version: "130m-instruct" (default) or "130m-base"
# Specify the backend: "huggingface" (default) or "onnx"
model = lille("huggingface", "130m-instruct") 

# --- For Chat (with instruct model) ---
print("--- Chat Example ---")
response1 = model.chat("What is the capital of France?", max_new_tokens=50)
print(f"Bot: {response1}")

response2 = model.chat("And what is its population?", max_new_tokens=50, top_p=0.90)
print(f"Bot: {response2}")

# This resets the chat history
model.reset_chat()

# --- For Text Completion (with base or instruct model) ---
prompt = "Artificial Intelligence is"
response = model.generate(prompt, max_new_tokens=50, temperature=0.9)
print(f"\n--- Completion Example ---\n{prompt}{response}")
```

### 3. Standard Hugging Face Transformers (this also needs `simpleai-sdk` currently)

You can also use the model directly with the `transformers` library for more advanced use cases.

```bash
pip install transformers torch simpleai-sdk
```

```python
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from simple_ai.model_hf import LilleConfig, LilleForCausalLM

# 1. Register the custom model architecture with Hugging Face
AutoConfig.register("lille-130m", LilleConfig)
AutoModelForCausalLM.register(LilleConfig, LilleForCausalLM)

# 2. Define constants and setup device
MODEL = "Nikity/lille-130m-instruct"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# 3. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype="auto",
    device_map=DEVICE,
)

# 4. Prepare chat prompt and tokenize it
chat = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
    chat,
    add_generation_prompt=True,
    return_tensors="pt"
).to(DEVICE)

# 5. Generate a response
with torch.inference_mode():
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=True,
        temperature=0.5,
        top_p=0.95,
    )

# 6. Decode and print the response
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
```

## ๐Ÿš€ Training and Finetuning

You can replicate the pretraining of `Lille-130M-Base` or fine-tune it on your own dataset using the provided scripts.

#### 1. Setup

First, clone the repository and install the required dependencies:

```bash
git clone https://github.com/Nikityyy/lille
cd lille
pip install -r requirements.txt
```

**Note on the Optimizer:** The default `Sophia-Triton` optimizer requires the [Triton](https://triton-lang.org/main/getting-started/installation.html) library. Triton is officially supported on Linux with NVIDIA GPUs. While experimental installation on Windows is possible, it can be a complex and difficult process. For a much simpler setup on **Windows and macOS**, or if you prefer not to install Triton, it is highly recommended to use a pure PyTorch implementation of Sophia instead:

1.  Replace the contents of the `sophia_triton.py` file with the code from [this link](https://github.com/Liuhong99/Sophia/blob/main/sophia.py).
2.  The `train.py` script should work without any import changes, as the class name `SophiaG` is the same.

#### 2. Data Preparation

The training script expects data in a specific `.npz` format containing tokenized documents and their offsets.

**For Pretraining (like FineWeb-Edu):**

Use the `prepare_dataset_fineweb.py` script. It will stream the dataset from Hugging Face, apply filters, tokenize the text, and save it in the required format.

```bash
python prepare_dataset_fineweb.py
```
This will create `data/fineweb_edu_sample_10BT/train.npz` and `val.npz`.

**For Finetuning (Instruction Datasets):**

Use the `prepare_dataset.py` script. Your input data should be a single `.txt` file where each example is separated by the `<|endoftext|>` token.

1.  Place your data file, for example, at `data/my_dataset/train.txt`.
2.  Modify the `input_file_path` and `output_dir` variables in `prepare_dataset.py`.
3.  Run the script:

```bash
python prepare_dataset.py
```
This will create `train.npz` and `val.npz` in your specified output directory.

#### 3. Running the Training Script

All training logic is handled by `train.py`. You can configure hyperparameters directly at the top of this file.

**To Pretrain from Scratch:**

1.  Ensure you have prepared a pretraining dataset.
2.  In `train.py`, set `finetune = False`.
3.  Configure pretraining parameters like `data_dir`, `batch_size`, etc.
4.  Run the script:

```bash
python train.py
```

**To Fine-tune a Pretrained Model:**

1.  Ensure you have prepared a fine-tuning dataset.
2.  In `train.py`, set `finetune = True`.
3.  Set `resume_checkpoint` to the path of the pretrained model checkpoint (e.g., `checkpoints/best_model.pt`).
4.  Configure fine-tuning parameters like `finetune_data_dir` and `finetune_learning_rate`.
5.  Run the script:

```bash
python train.py
```

Checkpoints will be saved in the directory specified by `out_dir` (for pretraining) or `finetune_out_dir` (for fine-tuning). The best model based on validation loss will be saved as `best_model.pt`.

## ๐Ÿ› ๏ธ Training Details

### Pretraining (`Lille-130M-Base`)
*   **Dataset:** Pretrained on **4.27 billion tokens** from the `sample-10BT` configuration of the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
*   **Tokenizer:** The custom **[Hastings](https://github.com/Nikityyy/Hastings)** tokenizer with a 32,768 vocabulary size.
*   **Optimizer:** The memory-efficient **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** optimizer.
*   **Hardware:** Trained on a single NVIDIA RTX 4070-TI.
*   **Precision:** bfloat16.

### Instruction Tuning (`Lille-130M-Instruct`)
*   **Dataset:** Supervised Fine-Tuning (SFT) was performed on the **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)**, a high-quality, curated collection of conversational and instructional data.

### Model Architecture
*   **Type:** Transformer Decoder
*   **Layers:** 24
*   **Embedding Size:** 640
*   **Attention Heads:** 10
*   **KV Heads (GQA):** 2
*   **Context Length:** 512 tokens

## Limitations

Lille models primarily understand and generate content in English. While powerful for their size, they can produce text that may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.

## ๐Ÿ› ๏ธ The truly open-source repos

Lille is a key component of my initiative to build and release a complete, truly open-source stack for language modeling. All components are designed to work together seamlessly.

*   **Tokenizer:** **[Hastings](https://github.com/Nikityyy/Hastings)** - A modern, efficient tokenizer with a 32k vocabulary.
*   **Dataset:** **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)** - A high-quality, small-scale dataset for instruction tuning.
*   **Model:** **[lille](https://github.com/Nikityyy/lille)** (this model) - A powerful 130-million-parameter model trained from scratch.
*   **Optimizer:** **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - A memory-efficient, Triton-based implementation of the SophiaG optimizer.
*   **Evaluations:** **[simple-eval](https://github.com/Nikityyy/simple-eval)** - A straightforward framework for evaluating model performance using an LLM as a Judge.

## ๐Ÿ“œ License

This project is licensed under the Apache-2.0 License.

## Citation

If you use Lille or any part of this open-source stack in your work, please consider citing it:

```bibtex
@misc{lille-130m,
  author       = {Nikita Berger},
  title        = {Lille: A Truly Open-Source 130M Language Model},
  year         = {2025},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/Nikityyy/lille}}
}
```