zaiffi commited on
Commit
2480fdf
·
verified ·
1 Parent(s): d22616b

Upload 7 files

Browse files
Files changed (7) hide show
  1. LICENSE +201 -0
  2. all_texts.txt +0 -0
  3. model_weights.pth +3 -0
  4. poetry_generation.ipynb +736 -0
  5. requirements.txt +2 -0
  6. urdu_sp.model +3 -0
  7. urdu_sp.vocab +0 -0
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
all_texts.txt ADDED
The diff for this file is too large to render. See raw diff
 
model_weights.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97172d5f7c44fa2488ade1fa4cc373ad800f7d2a779d732a361b3688a78769b4
3
+ size 243207232
poetry_generation.ipynb ADDED
@@ -0,0 +1,736 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "gpuType": "T4"
8
+ },
9
+ "kernelspec": {
10
+ "name": "python3",
11
+ "display_name": "Python 3"
12
+ },
13
+ "language_info": {
14
+ "name": "python"
15
+ },
16
+ "accelerator": "GPU"
17
+ },
18
+ "cells": [
19
+ {
20
+ "cell_type": "code",
21
+ "source": [],
22
+ "metadata": {
23
+ "id": "I9Z5guQ6CDt8"
24
+ },
25
+ "execution_count": null,
26
+ "outputs": []
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "source": [],
31
+ "metadata": {
32
+ "id": "PGeicEbzCDw9"
33
+ },
34
+ "execution_count": null,
35
+ "outputs": []
36
+ },
37
+ {
38
+ "cell_type": "code",
39
+ "source": [
40
+ "pip install sentencepiece torch torchvision torchaudio pandas scikit-learn\n"
41
+ ],
42
+ "metadata": {
43
+ "colab": {
44
+ "base_uri": "https://localhost:8080/"
45
+ },
46
+ "id": "zQFdKxIICD0H",
47
+ "outputId": "5d35d6a1-a876-4c7f-fee8-4f04888f3854"
48
+ },
49
+ "execution_count": 1,
50
+ "outputs": [
51
+ {
52
+ "output_type": "stream",
53
+ "name": "stdout",
54
+ "text": [
55
+ "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.11/dist-packages (0.2.0)\n",
56
+ "Requirement already satisfied: torch in /usr/local/lib/python3.11/dist-packages (2.5.1+cu124)\n",
57
+ "Requirement already satisfied: torchvision in /usr/local/lib/python3.11/dist-packages (0.20.1+cu124)\n",
58
+ "Requirement already satisfied: torchaudio in /usr/local/lib/python3.11/dist-packages (2.5.1+cu124)\n",
59
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (2.2.2)\n",
60
+ "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (1.6.1)\n",
61
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from torch) (3.17.0)\n",
62
+ "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.11/dist-packages (from torch) (4.12.2)\n",
63
+ "Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch) (3.4.2)\n",
64
+ "Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.5)\n",
65
+ "Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch) (2024.10.0)\n",
66
+ "Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)\n",
67
+ " Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
68
+ "Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)\n",
69
+ " Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
70
+ "Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)\n",
71
+ " Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
72
+ "Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)\n",
73
+ " Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
74
+ "Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)\n",
75
+ " Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
76
+ "Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)\n",
77
+ " Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
78
+ "Collecting nvidia-curand-cu12==10.3.5.147 (from torch)\n",
79
+ " Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
80
+ "Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)\n",
81
+ " Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
82
+ "Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)\n",
83
+ " Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
84
+ "Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch) (2.21.5)\n",
85
+ "Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch) (12.4.127)\n",
86
+ "Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch)\n",
87
+ " Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
88
+ "Requirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.11/dist-packages (from torch) (3.1.0)\n",
89
+ "Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch) (1.13.1)\n",
90
+ "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch) (1.3.0)\n",
91
+ "Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from torchvision) (1.26.4)\n",
92
+ "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.11/dist-packages (from torchvision) (11.1.0)\n",
93
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas) (2.8.2)\n",
94
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1)\n",
95
+ "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1)\n",
96
+ "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.13.1)\n",
97
+ "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2)\n",
98
+ "Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.5.0)\n",
99
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
100
+ "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch) (3.0.2)\n",
101
+ "Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)\n",
102
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
103
+ "\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)\n",
104
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.8/13.8 MB\u001b[0m \u001b[31m110.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
105
+ "\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)\n",
106
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.6/24.6 MB\u001b[0m \u001b[31m89.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
107
+ "\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)\n",
108
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m883.7/883.7 kB\u001b[0m \u001b[31m58.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
109
+ "\u001b[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)\n",
110
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m1.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
111
+ "\u001b[?25hDownloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)\n",
112
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
113
+ "\u001b[?25hDownloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)\n",
114
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m12.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
115
+ "\u001b[?25hDownloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)\n",
116
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
117
+ "\u001b[?25hDownloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)\n",
118
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m5.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
119
+ "\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)\n",
120
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m85.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
121
+ "\u001b[?25hInstalling collected packages: nvidia-nvjitlink-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12\n",
122
+ " Attempting uninstall: nvidia-nvjitlink-cu12\n",
123
+ " Found existing installation: nvidia-nvjitlink-cu12 12.5.82\n",
124
+ " Uninstalling nvidia-nvjitlink-cu12-12.5.82:\n",
125
+ " Successfully uninstalled nvidia-nvjitlink-cu12-12.5.82\n",
126
+ " Attempting uninstall: nvidia-curand-cu12\n",
127
+ " Found existing installation: nvidia-curand-cu12 10.3.6.82\n",
128
+ " Uninstalling nvidia-curand-cu12-10.3.6.82:\n",
129
+ " Successfully uninstalled nvidia-curand-cu12-10.3.6.82\n",
130
+ " Attempting uninstall: nvidia-cufft-cu12\n",
131
+ " Found existing installation: nvidia-cufft-cu12 11.2.3.61\n",
132
+ " Uninstalling nvidia-cufft-cu12-11.2.3.61:\n",
133
+ " Successfully uninstalled nvidia-cufft-cu12-11.2.3.61\n",
134
+ " Attempting uninstall: nvidia-cuda-runtime-cu12\n",
135
+ " Found existing installation: nvidia-cuda-runtime-cu12 12.5.82\n",
136
+ " Uninstalling nvidia-cuda-runtime-cu12-12.5.82:\n",
137
+ " Successfully uninstalled nvidia-cuda-runtime-cu12-12.5.82\n",
138
+ " Attempting uninstall: nvidia-cuda-nvrtc-cu12\n",
139
+ " Found existing installation: nvidia-cuda-nvrtc-cu12 12.5.82\n",
140
+ " Uninstalling nvidia-cuda-nvrtc-cu12-12.5.82:\n",
141
+ " Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.5.82\n",
142
+ " Attempting uninstall: nvidia-cuda-cupti-cu12\n",
143
+ " Found existing installation: nvidia-cuda-cupti-cu12 12.5.82\n",
144
+ " Uninstalling nvidia-cuda-cupti-cu12-12.5.82:\n",
145
+ " Successfully uninstalled nvidia-cuda-cupti-cu12-12.5.82\n",
146
+ " Attempting uninstall: nvidia-cublas-cu12\n",
147
+ " Found existing installation: nvidia-cublas-cu12 12.5.3.2\n",
148
+ " Uninstalling nvidia-cublas-cu12-12.5.3.2:\n",
149
+ " Successfully uninstalled nvidia-cublas-cu12-12.5.3.2\n",
150
+ " Attempting uninstall: nvidia-cusparse-cu12\n",
151
+ " Found existing installation: nvidia-cusparse-cu12 12.5.1.3\n",
152
+ " Uninstalling nvidia-cusparse-cu12-12.5.1.3:\n",
153
+ " Successfully uninstalled nvidia-cusparse-cu12-12.5.1.3\n",
154
+ " Attempting uninstall: nvidia-cudnn-cu12\n",
155
+ " Found existing installation: nvidia-cudnn-cu12 9.3.0.75\n",
156
+ " Uninstalling nvidia-cudnn-cu12-9.3.0.75:\n",
157
+ " Successfully uninstalled nvidia-cudnn-cu12-9.3.0.75\n",
158
+ " Attempting uninstall: nvidia-cusolver-cu12\n",
159
+ " Found existing installation: nvidia-cusolver-cu12 11.6.3.83\n",
160
+ " Uninstalling nvidia-cusolver-cu12-11.6.3.83:\n",
161
+ " Successfully uninstalled nvidia-cusolver-cu12-11.6.3.83\n",
162
+ "Successfully installed nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nvjitlink-cu12-12.4.127\n"
163
+ ]
164
+ }
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "code",
169
+ "source": [
170
+ "\n",
171
+ "\n",
172
+ "!pip install sentencepiece --quiet"
173
+ ],
174
+ "metadata": {
175
+ "id": "6DbIAMlqNDRK"
176
+ },
177
+ "execution_count": 2,
178
+ "outputs": []
179
+ },
180
+ {
181
+ "cell_type": "code",
182
+ "source": [
183
+ "\n",
184
+ "\"\"\"\n",
185
+ "Model File for Roman Urdu Poetry Generation\n",
186
+ "\n",
187
+ "This file contains the complete code for:\n",
188
+ " - Data loading, cleaning, and tokenization using SentencePiece\n",
189
+ " - Train/Test/Validation split creation\n",
190
+ " - Dataset and DataLoader creation\n",
191
+ " - Definition of a BiLSTM Language Model (with 3 layers, dropout, etc.)\n",
192
+ " - Training, validation, and testing routines\n",
193
+ " - Saving the trained model weights\n",
194
+ " - A poetry generation function using nucleus (top-p) sampling with formatted output\n",
195
+ "\n",
196
+ "Run this file to train and test the model. The trained weights will be saved to a file and loaded on subsequent runs.\n",
197
+ "\"\"\""
198
+ ],
199
+ "metadata": {
200
+ "colab": {
201
+ "base_uri": "https://localhost:8080/",
202
+ "height": 157
203
+ },
204
+ "id": "DjB6rAwz-D3Q",
205
+ "outputId": "817edbf7-6063-4c8c-fb49-30b18dd386b5"
206
+ },
207
+ "execution_count": 3,
208
+ "outputs": [
209
+ {
210
+ "output_type": "execute_result",
211
+ "data": {
212
+ "text/plain": [
213
+ "'\\nModel File for Roman Urdu Poetry Generation\\n\\nThis file contains the complete code for:\\n - Data loading, cleaning, and tokenization using SentencePiece\\n - Train/Test/Validation split creation\\n - Dataset and DataLoader creation\\n - Definition of a BiLSTM Language Model (with 3 layers, dropout, etc.)\\n - Training, validation, and testing routines\\n - Saving the trained model weights\\n - A poetry generation function using nucleus (top-p) sampling with formatted output\\n\\nRun this file to train and test the model. The trained weights will be saved to a file and loaded on subsequent runs.\\n'"
214
+ ],
215
+ "application/vnd.google.colaboratory.intrinsic+json": {
216
+ "type": "string"
217
+ }
218
+ },
219
+ "metadata": {},
220
+ "execution_count": 3
221
+ }
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "source": [
227
+ "# -------------------------\n",
228
+ "# 1. Import Libraries\n",
229
+ "# -------------------------\n",
230
+ "import os\n",
231
+ "import random\n",
232
+ "import numpy as np\n",
233
+ "import pandas as pd\n",
234
+ "import sentencepiece as spm\n",
235
+ "import re\n",
236
+ "import torch\n",
237
+ "import torch.nn as nn\n",
238
+ "from torch.utils.data import Dataset, DataLoader\n",
239
+ "import torch.nn.functional as F\n",
240
+ "import unicodedata\n",
241
+ "from sklearn.model_selection import train_test_split"
242
+ ],
243
+ "metadata": {
244
+ "id": "HoqaPLEq-Ega"
245
+ },
246
+ "execution_count": 4,
247
+ "outputs": []
248
+ },
249
+ {
250
+ "cell_type": "code",
251
+ "source": [
252
+ "\n",
253
+ "# -------------------------\n",
254
+ "# 2. Set Random Seeds and Device\n",
255
+ "# -------------------------\n",
256
+ "SEED = 42\n",
257
+ "random.seed(SEED)\n",
258
+ "np.random.seed(SEED)\n",
259
+ "torch.manual_seed(SEED)\n",
260
+ "\n",
261
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
262
+ "print(\"Using device:\", device)"
263
+ ],
264
+ "metadata": {
265
+ "colab": {
266
+ "base_uri": "https://localhost:8080/"
267
+ },
268
+ "id": "u4Xf1Ck6-H-M",
269
+ "outputId": "f171c4a8-4e30-4873-ebf9-2782aa3e9bdc"
270
+ },
271
+ "execution_count": 5,
272
+ "outputs": [
273
+ {
274
+ "output_type": "stream",
275
+ "name": "stdout",
276
+ "text": [
277
+ "Using device: cuda\n"
278
+ ]
279
+ }
280
+ ]
281
+ },
282
+ {
283
+ "cell_type": "code",
284
+ "source": [
285
+ "# -------------------------\n",
286
+ "# 3. Load and Clean Dataset\n",
287
+ "# -------------------------\n",
288
+ "DATA_PATH = \"Roman-Urdu-Poetry.csv\" # Make sure this file exists in your working directory\n",
289
+ "df = pd.read_csv(DATA_PATH)\n",
290
+ "\n",
291
+ "def remove_diacritics(text: str) -> str:\n",
292
+ " \"\"\"\n",
293
+ " Removes Unicode diacritical marks from the text.\n",
294
+ " \"\"\"\n",
295
+ " return ''.join(ch for ch in unicodedata.normalize('NFD', text)\n",
296
+ " if not unicodedata.combining(ch))\n",
297
+ "\n",
298
+ "def clean_text(text):\n",
299
+ " \"\"\"\n",
300
+ " Cleans the input text by removing diacritics, extra spaces, and unwanted punctuation.\n",
301
+ " \"\"\"\n",
302
+ " text = remove_diacritics(text)\n",
303
+ " text = re.sub(r\"\\s+\", \" \", text)\n",
304
+ " text = re.sub(r\"[^\\w\\s\\.\\,\\;\\:\\'\\?\\!\\-]+\", \"\", text)\n",
305
+ " return text.strip()\n",
306
+ "\n",
307
+ "df[\"Poetry\"] = df[\"Poetry\"].astype(str).apply(clean_text)\n",
308
+ "texts = df[\"Poetry\"].tolist()\n",
309
+ "print(f\"Total number of poetry lines: {len(texts)}\")"
310
+ ],
311
+ "metadata": {
312
+ "colab": {
313
+ "base_uri": "https://localhost:8080/"
314
+ },
315
+ "id": "MYJTunkz-LDb",
316
+ "outputId": "82609d66-3e91-4795-eac5-251bf9bf8dd1"
317
+ },
318
+ "execution_count": 6,
319
+ "outputs": [
320
+ {
321
+ "output_type": "stream",
322
+ "name": "stdout",
323
+ "text": [
324
+ "Total number of poetry lines: 1314\n"
325
+ ]
326
+ }
327
+ ]
328
+ },
329
+ {
330
+ "cell_type": "code",
331
+ "source": [
332
+ "# -------------------------\n",
333
+ "# 4. Train/Test/Validation Split (80/10/10)\n",
334
+ "# -------------------------\n",
335
+ "train_texts, test_texts = train_test_split(texts, test_size=0.1, random_state=SEED)\n",
336
+ "train_texts, val_texts = train_test_split(train_texts, test_size=0.1111, random_state=SEED)\n",
337
+ "print(f\"Train samples: {len(train_texts)}\")\n",
338
+ "print(f\"Validation samples: {len(val_texts)}\")\n",
339
+ "print(f\"Test samples: {len(test_texts)}\")"
340
+ ],
341
+ "metadata": {
342
+ "colab": {
343
+ "base_uri": "https://localhost:8080/"
344
+ },
345
+ "id": "_VvgUa3L-MAR",
346
+ "outputId": "d045fd71-3f09-4d6c-eea9-34c3e444db59"
347
+ },
348
+ "execution_count": 7,
349
+ "outputs": [
350
+ {
351
+ "output_type": "stream",
352
+ "name": "stdout",
353
+ "text": [
354
+ "Train samples: 1050\n",
355
+ "Validation samples: 132\n",
356
+ "Test samples: 132\n"
357
+ ]
358
+ }
359
+ ]
360
+ },
361
+ {
362
+ "cell_type": "code",
363
+ "source": [
364
+ "# -------------------------\n",
365
+ "# 5. Train a SentencePiece BPE Tokenizer\n",
366
+ "# -------------------------\n",
367
+ "all_texts_file = \"all_texts.txt\"\n",
368
+ "if not os.path.exists(all_texts_file):\n",
369
+ " with open(all_texts_file, \"w\", encoding=\"utf-8\") as f:\n",
370
+ " for line in texts:\n",
371
+ " f.write(line.strip() + \"\\n\")\n",
372
+ "else:\n",
373
+ " print(f\"{all_texts_file} already exists; skipping file creation.\")\n",
374
+ "\n",
375
+ "\n",
376
+ "sp_model_prefix = \"urdu_sp\"\n",
377
+ "model_file = f\"{sp_model_prefix}.model\"\n",
378
+ "vocab_file = f\"{sp_model_prefix}.vocab\"\n",
379
+ "\n",
380
+ "vocab_size = 12000 # Adjust as needed\n",
381
+ "model_type = \"bpe\"\n",
382
+ "\n",
383
+ "if not (os.path.exists(model_file) and os.path.exists(vocab_file)):\n",
384
+ " print(\"SentencePiece model or vocab not found. Training...\")\n",
385
+ " spm.SentencePieceTrainer.Train(\n",
386
+ " f\"--input={all_texts_file} \"\n",
387
+ " f\"--model_prefix={sp_model_prefix} \"\n",
388
+ " f\"--vocab_size={vocab_size} \"\n",
389
+ " f\"--model_type={model_type} \"\n",
390
+ " \"--character_coverage=1.0 \"\n",
391
+ " \"--pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3\"\n",
392
+ " )\n",
393
+ "else:\n",
394
+ " print(\"SentencePiece model & vocab found; skipping training.\")\n",
395
+ "\n",
396
+ "# Load the SentencePiece model\n",
397
+ "sp = spm.SentencePieceProcessor()\n",
398
+ "sp.load(model_file)\n",
399
+ "print(\"Loaded SentencePiece model with vocab size:\", sp.get_piece_size())\n"
400
+ ],
401
+ "metadata": {
402
+ "colab": {
403
+ "base_uri": "https://localhost:8080/"
404
+ },
405
+ "id": "2L1JgC02-OBW",
406
+ "outputId": "d6ea06cf-8f54-47d8-fada-a016ca1df4c9"
407
+ },
408
+ "execution_count": 8,
409
+ "outputs": [
410
+ {
411
+ "output_type": "stream",
412
+ "name": "stdout",
413
+ "text": [
414
+ "Loaded SentencePiece model with vocab size: 12000\n"
415
+ ]
416
+ }
417
+ ]
418
+ },
419
+ {
420
+ "cell_type": "code",
421
+ "source": [
422
+ "# -------------------------\n",
423
+ "# 6. Tokenize Data\n",
424
+ "# -------------------------\n",
425
+ "train_ids = [sp.encode_as_ids(t) for t in train_texts]\n",
426
+ "val_ids = [sp.encode_as_ids(t) for t in val_texts]\n",
427
+ "test_ids = [sp.encode_as_ids(t) for t in test_texts]"
428
+ ],
429
+ "metadata": {
430
+ "id": "lq7lbUcu-RDU"
431
+ },
432
+ "execution_count": 9,
433
+ "outputs": []
434
+ },
435
+ {
436
+ "cell_type": "code",
437
+ "source": [
438
+ "# -------------------------\n",
439
+ "# 7. Create Dataset and DataLoader\n",
440
+ "# -------------------------\n",
441
+ "class PoetryDataset(Dataset):\n",
442
+ " def __init__(self, token_ids_list, max_length=250):\n",
443
+ " self.data = token_ids_list\n",
444
+ " self.max_length = max_length\n",
445
+ "\n",
446
+ " def __len__(self):\n",
447
+ " return len(self.data)\n",
448
+ "\n",
449
+ " def __getitem__(self, idx):\n",
450
+ " # Truncate tokens to max_length\n",
451
+ " token_ids = self.data[idx][:self.max_length]\n",
452
+ " # Create input by adding BOS token (2) at the beginning\n",
453
+ " input_ids = [2] + token_ids\n",
454
+ " # Create target by appending EOS token (3) at the end\n",
455
+ " target_ids = token_ids + [3]\n",
456
+ " return torch.tensor(input_ids, dtype=torch.long), torch.tensor(target_ids, dtype=torch.long)\n",
457
+ "\n",
458
+ "def collate_fn(batch):\n",
459
+ " inputs, targets = zip(*batch)\n",
460
+ " max_len = max(len(x) for x in inputs)\n",
461
+ " padded_inputs = [torch.cat([x, torch.zeros(max_len - len(x), dtype=torch.long)]) for x in inputs]\n",
462
+ " padded_targets = [torch.cat([t, torch.zeros(max_len - len(t), dtype=torch.long)]) for t in targets]\n",
463
+ " return torch.stack(padded_inputs), torch.stack(padded_targets)"
464
+ ],
465
+ "metadata": {
466
+ "id": "OZ9_kG0M-TOF"
467
+ },
468
+ "execution_count": 10,
469
+ "outputs": []
470
+ },
471
+ {
472
+ "cell_type": "code",
473
+ "source": [
474
+ "train_dataset = PoetryDataset(train_ids, max_length=250)\n",
475
+ "val_dataset = PoetryDataset(val_ids, max_length=250)\n",
476
+ "test_dataset = PoetryDataset(test_ids, max_length=250)\n",
477
+ "\n",
478
+ "batch_size = 64\n",
479
+ "train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn, drop_last=True)\n",
480
+ "val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, drop_last=True)\n",
481
+ "test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn, drop_last=True)"
482
+ ],
483
+ "metadata": {
484
+ "id": "z1aGUj-w-Xh9"
485
+ },
486
+ "execution_count": 11,
487
+ "outputs": []
488
+ },
489
+ {
490
+ "cell_type": "code",
491
+ "source": [
492
+ "# -------------------------\n",
493
+ "# 8. Define the BiLSTM Language Model\n",
494
+ "# -------------------------\n",
495
+ "class BiLSTMLanguageModel(nn.Module):\n",
496
+ " def __init__(self, vocab_size, embed_dim=512, hidden_dim=768, num_layers=3, dropout=0.2):\n",
497
+ " super(BiLSTMLanguageModel, self).__init__()\n",
498
+ " self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)\n",
499
+ " # Stacked Bi-LSTM layers\n",
500
+ " self.lstm = nn.LSTM(\n",
501
+ " input_size=embed_dim,\n",
502
+ " hidden_size=hidden_dim,\n",
503
+ " num_layers=num_layers,\n",
504
+ " batch_first=True,\n",
505
+ " bidirectional=True,\n",
506
+ " dropout=dropout\n",
507
+ " )\n",
508
+ " # Linear layer to project LSTM outputs to vocabulary size\n",
509
+ " self.fc = nn.Linear(hidden_dim * 2, vocab_size)\n",
510
+ "\n",
511
+ " def forward(self, x, hidden=None):\n",
512
+ " emb = self.embed(x)\n",
513
+ " out, hidden = self.lstm(emb, hidden)\n",
514
+ " logits = self.fc(out)\n",
515
+ " return logits, hidden"
516
+ ],
517
+ "metadata": {
518
+ "id": "YD8F_0WM-apV"
519
+ },
520
+ "execution_count": 12,
521
+ "outputs": []
522
+ },
523
+ {
524
+ "cell_type": "code",
525
+ "source": [
526
+ "vocab_size = sp.get_piece_size()\n",
527
+ "model = BiLSTMLanguageModel(vocab_size, embed_dim=512, hidden_dim=768, num_layers=3, dropout=0.2)\n",
528
+ "model = model.to(device)"
529
+ ],
530
+ "metadata": {
531
+ "id": "aKWTogmN-gaq"
532
+ },
533
+ "execution_count": 13,
534
+ "outputs": []
535
+ },
536
+ {
537
+ "cell_type": "code",
538
+ "source": [
539
+ "# -------------------------\n",
540
+ "# 9. Training Setup (Loss, Optimizer, Scheduler)\n",
541
+ "# -------------------------\n",
542
+ "criterion = nn.CrossEntropyLoss(ignore_index=0)\n",
543
+ "optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)\n",
544
+ "scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.5)\n",
545
+ "\n",
546
+ "def evaluate(model, data_loader):\n",
547
+ " model.eval()\n",
548
+ " total_loss, total_tokens = 0, 0\n",
549
+ " with torch.no_grad():\n",
550
+ " for inputs, targets in data_loader:\n",
551
+ " inputs = inputs.to(device)\n",
552
+ " targets = targets.to(device)\n",
553
+ " logits, _ = model(inputs)\n",
554
+ " logits = logits.view(-1, vocab_size)\n",
555
+ " targets = targets.view(-1)\n",
556
+ " loss = criterion(logits, targets)\n",
557
+ " total_loss += loss.item() * (targets != 0).sum().item()\n",
558
+ " total_tokens += (targets != 0).sum().item()\n",
559
+ " return total_loss / total_tokens"
560
+ ],
561
+ "metadata": {
562
+ "id": "9W5USllq-i83"
563
+ },
564
+ "execution_count": 14,
565
+ "outputs": []
566
+ },
567
+ {
568
+ "cell_type": "code",
569
+ "source": [
570
+ "# -------------------------\n",
571
+ "# 10. Training Loop with Testing Code and Weight Saving\n",
572
+ "# -------------------------\n",
573
+ "num_epochs = 10\n",
574
+ "weights_path = \"model_weights.pth\"\n",
575
+ "\n",
576
+ "if not os.path.exists(weights_path):\n",
577
+ " for epoch in range(num_epochs):\n",
578
+ " model.train()\n",
579
+ " total_loss, total_tokens = 0, 0\n",
580
+ " for inputs, targets in train_loader:\n",
581
+ " inputs = inputs.to(device)\n",
582
+ " targets = targets.to(device)\n",
583
+ " optimizer.zero_grad()\n",
584
+ " logits, _ = model(inputs)\n",
585
+ " logits = logits.view(-1, vocab_size)\n",
586
+ " targets = targets.view(-1)\n",
587
+ " loss = criterion(logits, targets)\n",
588
+ " loss.backward()\n",
589
+ " torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)\n",
590
+ " optimizer.step()\n",
591
+ " total_loss += loss.item() * (targets != 0).sum().item()\n",
592
+ " total_tokens += (targets != 0).sum().item()\n",
593
+ " train_loss = total_loss / total_tokens\n",
594
+ " val_loss = evaluate(model, val_loader)\n",
595
+ " scheduler.step()\n",
596
+ " print(f\"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}\")\n",
597
+ " test_loss = evaluate(model, test_loader)\n",
598
+ " print(f\"Test Loss: {test_loss:.4f}\")\n",
599
+ " torch.save(model.state_dict(), weights_path)\n",
600
+ "else:\n",
601
+ " print(\"Loading pre-trained model weights...\")\n",
602
+ " model.load_state_dict(torch.load(weights_path, map_location=device))"
603
+ ],
604
+ "metadata": {
605
+ "colab": {
606
+ "base_uri": "https://localhost:8080/"
607
+ },
608
+ "id": "B0nDauKT-nQC",
609
+ "outputId": "c082b8a8-70fb-4375-8b89-6deb72b31f6f"
610
+ },
611
+ "execution_count": 15,
612
+ "outputs": [
613
+ {
614
+ "output_type": "stream",
615
+ "name": "stdout",
616
+ "text": [
617
+ "Epoch [1/10], Train Loss: 7.1034, Val Loss: 6.2269\n",
618
+ "Epoch [2/10], Train Loss: 5.7528, Val Loss: 5.4652\n",
619
+ "Epoch [3/10], Train Loss: 5.0948, Val Loss: 4.9459\n",
620
+ "Epoch [4/10], Train Loss: 4.4997, Val Loss: 4.2981\n",
621
+ "Epoch [5/10], Train Loss: 3.9654, Val Loss: 3.9398\n",
622
+ "Epoch [6/10], Train Loss: 3.6264, Val Loss: 3.6214\n",
623
+ "Epoch [7/10], Train Loss: 3.3671, Val Loss: 3.4665\n",
624
+ "Epoch [8/10], Train Loss: 3.2082, Val Loss: 3.3188\n",
625
+ "Epoch [9/10], Train Loss: 3.0880, Val Loss: 3.2478\n",
626
+ "Epoch [10/10], Train Loss: 3.0126, Val Loss: 3.1772\n",
627
+ "Test Loss: 3.1696\n"
628
+ ]
629
+ }
630
+ ]
631
+ },
632
+ {
633
+ "cell_type": "code",
634
+ "source": [
635
+ "\n",
636
+ "\n",
637
+ "def generate_poetry_nucleus(model, sp, start_word, num_words=12, temperature=1.2, top_p=0.85):\n",
638
+ " \"\"\"\n",
639
+ " Generate a poetry sequence using nucleus (top-p) sampling.\n",
640
+ " The output is formatted so that every 6 words appear on a new line.\n",
641
+ " If num_words is specified, it means 1 starting word + (num_words - 1) generated tokens.\n",
642
+ " \"\"\"\n",
643
+ " model.eval()\n",
644
+ " start_ids = sp.encode_as_ids(start_word)\n",
645
+ " input_ids = [2] + start_ids # Insert BOS (token 2)\n",
646
+ " input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device)\n",
647
+ " hidden = None\n",
648
+ "\n",
649
+ " with torch.no_grad():\n",
650
+ " logits, hidden = model(input_tensor, hidden)\n",
651
+ "\n",
652
+ " generated_ids = input_ids[:] # Copy initial tokens\n",
653
+ "\n",
654
+ " for _ in range(num_words - 1): # Generate one less token\n",
655
+ " # Get the logits of the last generated token\n",
656
+ " last_logits = logits[:, -1, :] # Shape: (1, vocab_size)\n",
657
+ " scaled_logits = last_logits / temperature\n",
658
+ "\n",
659
+ " # Sort the logits in descending order\n",
660
+ " sorted_logits, sorted_indices = torch.sort(scaled_logits, descending=True)\n",
661
+ " cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)\n",
662
+ "\n",
663
+ " # Filter out tokens with cumulative probability above top_p\n",
664
+ " filtered_indices = cumulative_probs > top_p\n",
665
+ " if torch.all(filtered_indices):\n",
666
+ " filtered_indices[-1] = False # Ensure at least one token remains\n",
667
+ " sorted_indices = sorted_indices[~filtered_indices]\n",
668
+ " sorted_logits = sorted_logits[~filtered_indices]\n",
669
+ "\n",
670
+ " # Sample the next token from the filtered distribution\n",
671
+ " if len(sorted_indices) > 0:\n",
672
+ " next_token_id = sorted_indices[torch.multinomial(F.softmax(sorted_logits, dim=-1), 1).item()].item()\n",
673
+ " else:\n",
674
+ " next_token_id = torch.argmax(last_logits).item()\n",
675
+ " generated_ids.append(next_token_id)\n",
676
+ "\n",
677
+ " # Prepare next input and update hidden state\n",
678
+ " next_input = torch.tensor([[next_token_id]], dtype=torch.long, device=device)\n",
679
+ " logits, hidden = model(next_input, hidden)\n",
680
+ "\n",
681
+ " # Decode generated tokens (skip BOS) and format output: 6 words per line\n",
682
+ " generated_text = sp.decode_ids(generated_ids[1:])\n",
683
+ " words = generated_text.split()\n",
684
+ " formatted_text = \"\\n\".join([\" \".join(words[i:i+6]) for i in range(0, len(words), 6)])\n",
685
+ " return formatted_text\n"
686
+ ],
687
+ "metadata": {
688
+ "id": "kmsILzIh_0um"
689
+ },
690
+ "execution_count": 16,
691
+ "outputs": []
692
+ },
693
+ {
694
+ "cell_type": "code",
695
+ "source": [
696
+ "\n",
697
+ "\n",
698
+ "# -------------------------\n",
699
+ "# 12. Example Usage for Testing (Optional)\n",
700
+ "# -------------------------\n",
701
+ "if __name__ == \"__main__\":\n",
702
+ " # Test the generation function in the notebook/script\n",
703
+ " start_word = \"ishq\"\n",
704
+ " print(\"Generated Poetry:\\n\", generate_poetry_nucleus(model, sp, start_word, num_words=12, temperature=1.2, top_p=0.85))\n"
705
+ ],
706
+ "metadata": {
707
+ "colab": {
708
+ "base_uri": "https://localhost:8080/"
709
+ },
710
+ "id": "a3WKAKtJ_8YU",
711
+ "outputId": "9571d2a7-97a4-4b1d-d106-3b7ccd0da43f"
712
+ },
713
+ "execution_count": 18,
714
+ "outputs": [
715
+ {
716
+ "output_type": "stream",
717
+ "name": "stdout",
718
+ "text": [
719
+ "Generated Poetry:\n",
720
+ " ishq nishan tum phir kar phir\n",
721
+ "ik baat aur phir ye phir\n"
722
+ ]
723
+ }
724
+ ]
725
+ },
726
+ {
727
+ "cell_type": "code",
728
+ "source": [],
729
+ "metadata": {
730
+ "id": "hK3-OgKI98Ia"
731
+ },
732
+ "execution_count": 17,
733
+ "outputs": []
734
+ }
735
+ ]
736
+ }
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ torch==2.6.0
2
+ sentencepiece==0.2.0
urdu_sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81ccdc84bc97783bd3b3ae632ec37ebd85124be7dd75650f5512824df6a413e2
3
+ size 429486
urdu_sp.vocab ADDED
The diff for this file is too large to render. See raw diff