Lyte commited on
Commit
07ddffa
·
verified ·
1 Parent(s): b072a18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -27
README.md CHANGED
@@ -1,50 +1,148 @@
1
  ---
2
  base_model: unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit
3
- library_name: transformers
4
  model_name: QuadConnect2.5-1.5B-v0.1.0b
 
 
5
  tags:
6
- - generated_from_trainer
7
  - unsloth
8
  - trl
9
  - grpo
 
 
 
10
  licence: license
 
 
 
 
11
  ---
12
 
13
- # Model Card for QuadConnect2.5-1.5B-v0.1.0b
 
 
 
 
 
 
 
 
14
 
15
- This model is a fine-tuned version of [unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit](https://huggingface.co/unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit).
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
- ## Quick start
 
 
 
 
 
 
 
 
 
19
 
20
  ```python
21
  from transformers import pipeline
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  generator = pipeline("text-generation", model="Lyte/QuadConnect2.5-1.5B-v0.1.0b", device="cuda")
25
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
 
 
 
 
 
 
26
  print(output["generated_text"])
27
  ```
28
 
29
- ## Training procedure
30
 
31
-
32
 
 
33
 
34
- This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
35
 
36
- ### Framework versions
37
 
38
- - TRL: 0.15.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  - Transformers: 4.49.0
40
- - Pytorch: 2.5.1+cu121
41
- - Datasets: 3.3.1
42
  - Tokenizers: 0.21.0
43
 
44
- ## Citations
45
-
46
- Cite GRPO as:
47
 
 
48
  ```bibtex
49
  @article{zhihong2024deepseekmath,
50
  title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
@@ -52,18 +150,16 @@ Cite GRPO as:
52
  year = 2024,
53
  eprint = {arXiv:2402.03300},
54
  }
55
-
56
  ```
57
 
58
- Cite TRL as:
59
-
60
  ```bibtex
61
  @misc{vonwerra2022trl,
62
- title = {{TRL: Transformer Reinforcement Learning}},
63
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
64
- year = 2020,
65
- journal = {GitHub repository},
66
- publisher = {GitHub},
67
- howpublished = {\url{https://github.com/huggingface/trl}}
68
  }
69
  ```
 
1
  ---
2
  base_model: unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit
 
3
  model_name: QuadConnect2.5-1.5B-v0.1.0b
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
  tags:
 
7
  - unsloth
8
  - trl
9
  - grpo
10
+ - connect4
11
+ - qwen
12
+ - RL
13
  licence: license
14
+ datasets:
15
+ - Lyte/ConnectFour-T10
16
+ language:
17
+ - en
18
  ---
19
 
20
+ # QuadConnect2.5-1.5B-v0.1.0b - A Strategic Connect Four AI
21
+
22
+ ![Connect Four Demo](https://cdn-uploads.huggingface.co/production/uploads/62f847d692950415b63c6011/QiDstnBXlVVz6dGrx3uus.png)
23
+
24
+ ## 🎮 Overview
25
+
26
+ QuadConnect2.5-1.5B is a specialized language model trained to master the game of Connect Four. Built on Qwen 2.5 (1.5B parameter base), this model uses GRPO (Group Relative Policy Optimization) to learn the strategic intricacies of Connect Four gameplay.
27
+
28
+ **Status**: Early training experiments (v0.0.9b) - Reward functions still evolving
29
 
30
+ ## 🔍 Model Details
 
31
 
32
+ - **Developed by:** [Lyte](https://hf.co/Lyte)
33
+ - **Model type:** Small Language Model (SLM)
34
+ - **Language:** English
35
+ - **Base model:** [unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit](https://huggingface.co/unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit)
36
+ - **Training method:** [TRL](https://github.com/huggingface/trl)'s GRPO
37
+ - **Training data:** [Lyte/ConnectFour-T10](https://huggingface.co/datasets/Lyte/ConnectFour-T10)
38
+
39
+ ## 🚀 Quick Start
40
+
41
+ ### Option 1: Using Transformers
42
 
43
  ```python
44
  from transformers import pipeline
45
 
46
+ SYSTEM_PROMPT = """You are a master Connect Four strategist whose goal is to win while preventing your opponent from winning. The game is played on a 6x7 grid (columns a–g, rows 1–6 with 1 at the bottom) where pieces drop to the lowest available spot.
47
+
48
+ Board:
49
+ - Represented as a list of occupied cells in the format: <column><row>(<piece>), e.g., 'a1(O)'.
50
+ - For example: 'a1(O), a2(X), b1(O)' indicates that cell a1 has an O, a2 has an X, and b1 has an O.
51
+ - An empty board is shown as 'Empty Board'.
52
+ - Win by connecting 4 pieces in any direction (horizontal, vertical, or diagonal).
53
+
54
+ Strategy:
55
+ 1. Identify taken positions, and empty positions.
56
+ 2. Find and execute winning moves.
57
+ 3. If There isn't a winning move, then block your opponent's potential wins.
58
+ 4. Control the center and set up future moves.
59
+
60
+ Respond in XML:
61
+ <reasoning>
62
+ Explain your thought process, focusing on your winning move, how you block your opponent, and your strategic plans.
63
+ </reasoning>
64
+ <move>
65
+ Specify the column letter (a–g) for your next move.
66
+ </move>
67
+ """
68
+
69
+ board = {
70
+ "empty": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: \n- Current board state: Empty Board\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
71
+ "one_move": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: b1\n- Current board state: b1(O)\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
72
+ "four_moves": "Game State:\n- You are playing as: X\n- Your previous moves: a1, a2\n- Opponent's moves: d1, a3\n- Current board state: a1(X), d1(O), a2(X), a3(O)\n- Next available position per column: \nColumn a: a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
73
+ }
74
+
75
  generator = pipeline("text-generation", model="Lyte/QuadConnect2.5-1.5B-v0.1.0b", device="cuda")
76
+
77
+ # use 'empty', 'one_move' or 'four_moves' in board['']
78
+ output = generator([
79
+ {"role": "system", "content": SYSTEM_PROMPT},
80
+ {"role": "user", "content": board['empty']}
81
+ ], max_new_tokens=10245, return_full_text=False)[0]
82
+
83
  print(output["generated_text"])
84
  ```
85
 
86
+ ### Option 2: Using GGUF
87
 
88
+ Download the [Quantized GGUF (Q8_0)](https://huggingface.co/Lyte/QuadConnect2.5-1.5B-v0.1.0b/blob/main/unsloth.Q8_0.gguf) and use it in your favorite GGUF inference engine (e.g., LMStudio).
89
 
90
+ ### Option 3: Using Hugging Face Space
91
 
92
+ Visit the [QuadConnect Space](https://huggingface.co/spaces/Lyte/QuadConnect) to interact with the model directly. You can also duplicate the space or download its code for local use.
93
 
94
+ ## 📊 Evaluation Results
95
 
96
+ Model performance was evaluated on the [Lyte/ConnectFour-T10](https://huggingface.co/datasets/Lyte/ConnectFour-T10) validation split with various temperature settings.
97
+ **Note:** It appears 0.8 is the ideal temperature, and that reward functions for game strategy are harder to balance.
98
+
99
+ ### Summary Metrics Comparison
100
+
101
+ | Metric | v0.0.6b (Temp 0.6) | v0.0.8b (Temp 0.6) | v0.0.9b (Temp 0.6) | v0.0.9b (Temp 0.8) | v0.0.9b (Temp 1.0) | v0.1.0b (Temp 0.8) |
102
+ |--------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
103
+ | Total games evaluated | 5082 | 5082 | 5082 | 5082 | 5082 | 5082 |
104
+ | Correct predictions | 518 | 394 | 516 | **713** | 677 | **809** |
105
+ | Accuracy | 10.19% | 7.75% | 10.15% | **14.03%** | 13.32% | **15.92%** |
106
+ | Most common move | d (41.14%) | d (67.61%) | a (38.72%) | a (31.01%) | a (26.99%) | e (20.89%) |
107
+ | Middle column usage | 75.05% | 99.53% | 29.08% | 35.43% | 39.49% | 12.92% |
108
+
109
+ ### Move Distribution by Column
110
+
111
+ | Column | v0.0.6b (Temp 0.6) | v0.0.8b (Temp 0.6) | v0.0.9b (Temp 0.6) | v0.0.9b (Temp 0.8) | v0.0.9b (Temp 1.0) | v0.1.0b (Temp ?) |
112
+ |--------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
113
+ | a | 603 (19.02%) | 3 (0.12%) | 1447 (38.72%) | 1547 (31.01%) | 1351 (26.99%) | 803 (16.71%) |
114
+ | b | 111 (3.50%) | 4 (0.16%) | 644 (17.23%) | 924 (18.52%) | 997 (19.92%) | 223 (4.64%) |
115
+ | c | 785 (24.76%) | 463 (17.96%) | 648 (17.34%) | 1003 (20.11%) | 985 (19.68%) | 844 (17.57%) |
116
+ | d | 1304 (41.14%) | 1743 (67.61%) | 101 (2.70%) | 202 (4.05%) | 306 (6.11%) | 621 (12.92%) |
117
+ | e | 290 (9.15%) | 360 (13.96%) | 338 (9.04%) | 562 (11.27%) | 686 (13.70%) | **1004 (20.89%)** |
118
+ | f | 50 (1.58%) | 3 (0.12%) | 310 (8.30%) | 408 (8.18%) | 354 (7.07%) | 459 (9.55%) |
119
+ | g | 27 (0.85%) | 2 (0.08%) | 249 (6.66%) | 342 (6.86%) | 327 (6.53%) | 851 (17.71%) |
120
+
121
+
122
+ ## 🔧 Training Details
123
+
124
+ ### Data Preparation
125
+ 1. Started with [Leon-LLM/Connect-Four-Datasets-Collection](https://huggingface.co/datasets/Leon-LLM/Connect-Four-Datasets-Collection)
126
+ 2. Filtered for clean, complete entries
127
+ 3. Further filtered to include only games with 10 or fewer turns
128
+ 4. Split into train and validation sets
129
+ 5. Final dataset: [Lyte/ConnectFour-T10](https://huggingface.co/datasets/Lyte/ConnectFour-T10)
130
+
131
+ ### Evaluation Parameters
132
+ - Temperature: 0.6, 0.8, 1.0 (compared)
133
+ - Top-p: 0.95
134
+ - Max tokens: 1024
135
+
136
+ ### Framework Versions
137
+ - TRL: 0.15.1
138
  - Transformers: 4.49.0
139
+ - PyTorch: 2.5.1+cu121
140
+ - Datasets: 3.2.0
141
  - Tokenizers: 0.21.0
142
 
143
+ ## 📚 Citations
 
 
144
 
145
+ For GRPO:
146
  ```bibtex
147
  @article{zhihong2024deepseekmath,
148
  title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
 
150
  year = 2024,
151
  eprint = {arXiv:2402.03300},
152
  }
 
153
  ```
154
 
155
+ For TRL:
 
156
  ```bibtex
157
  @misc{vonwerra2022trl,
158
+ title = {{TRL: Transformer Reinforcement Learning}},
159
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
160
+ year = 2020,
161
+ journal = {GitHub repository},
162
+ publisher = {GitHub},
163
+ howpublished = {\url{https://github.com/huggingface/trl}}
164
  }
165
  ```