guillermoruiz commited on
Commit
3a6db32
·
verified ·
1 Parent(s): 0a13215

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -163
README.md CHANGED
@@ -1,59 +1,136 @@
1
  ---
 
 
 
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
  <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
 
58
  ## Bias, Risks, and Limitations
59
 
@@ -61,113 +138,39 @@ This is the model card of a 🤗 transformers model that has been pushed on the
61
 
62
  [More Information Needed]
63
 
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
-
127
- ### Results
 
 
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
 
 
132
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
 
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
  ## Citation [optional]
172
 
173
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -179,21 +182,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
179
  **APA:**
180
 
181
  [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - es
4
+ license: mit
5
  library_name: transformers
 
6
  ---
7
 
8
+ # Modelo de Lenguaje para el español de México
9
 
10
  <!-- Provide a quick summary of what the model is/does. -->
11
+ Este modelo basado en Roberta se entrenó usando más de 140 millones de tweets de México en español. Recolectados entre diciembre del 2015 y febrero del 2023.
12
+ A cada mensaje se le agruegó una etiqueta de información regionalizada como sigue:
13
+
14
+ *estado* _GEO *mensaje*
15
+
16
+ Algunos ejemplos son:
17
+
18
+ - Coahuila _GEO Cómo estás amiga, nos conocemos? Soy soltero busco soltera. #PiedrasNegras #nava #allende #zaragoza
19
+ - Tamaulipas _GEO Ando de buenas que ya les devolví sus unfollows y métanselos por el culo ☺.
20
+ - BCS _GEO Ésa canción que cantas en silencio y la otra persona tmb. Bn raro.
21
+ - Tamaulipas _GEO Hoy es la primera vez que manejo en estado de ebriedad 😞🙃
22
+
23
+ Como se puede observar, se mantuvieron mayúsculas y minúsculas, emoticones y palabras mal escritas.
24
+ Por motivos de privacidad, se cambiaron las mensciones de usuario por el token _USR y las direcciones de
25
+ internet por _URL.
26
+
27
+
28
+ Los tokens que idican el estado de la república son:
29
+
30
+ |Estado|Token|
31
+ |----------|----------|
32
+ |Aguascalientes|Aguascalientes|
33
+ |Baja California|BC|
34
+ |Baja California Sur|BCS|
35
+ |Campeche|Campeche|
36
+ |Chiapas|Chiapas|
37
+ |Chihuahua|Chihuahua|
38
+ |Ciudad de México|Mexico_City|
39
+ |Coahuila de Zaragoza|Coahuila|
40
+ |Colima|Colima|
41
+ |Durango|Durango|
42
+ |Guanajuato|Guanajuato|
43
+ |Guerrero|Guerrero|
44
+ |Hidalgo|Hidalgo|
45
+ |Jalisco|Jalisco|
46
+ |Michoacán de Ocampo|Michoacán|
47
+ |Morelos|Morelos|
48
+ |México|Mexico|
49
+ |Nayarit|Nayarit|
50
+ |Nuevo León|NL|
51
+ |Oaxaca|Oaxaca|
52
+ |Puebla|Puebla|
53
+ |Querétaro|Querétaro|
54
+ |Quintana Roo|QR|
55
+ |San Luis Potosí|SLP|
56
+ |Sinaloa|Sinaloa|
57
+ |Sonora|Sonora|
58
+ |Tabasco|Tabasco|
59
+ |Tamaulipas|Tamaulipas|
60
+ |Tlaxcala|Tlaxcala|
61
+ |Veracruz de Ignacio de la Llave|Veracruz|
62
+ |Yucatán|Yucatán|
63
+ |Zacatecas|Zacatecas|
64
+
65
+ Se creó el vocabulario de tamaño 30k usando WordPiece. El modelo se entrenaron usando el enmascaramiento de palabras con probabilidad de 0.15.
66
+ Se usó el optimizador AdamW con una tasa de aprendizaje de 0.00002 durante una época.
67
+
68
+ ## Uso
69
+
70
+ El modelo se puede usar con una `pipeline`:
71
+ ```
72
+ from transformers import pipeline
73
+ unmasker = pipeline('fill-mask', model="guillermoruiz/mex_state")
74
+
75
+ ```
76
+
77
+ ```
78
+ for p in unmasker("<mask> _GEO Van a ganar los Tigres."):
79
+ print(p['token_str'], p['score'])
80
+ ```
81
+ Lo que produce la salida:
82
+ ```
83
+ NL 0.2888392508029938
84
+ Coahuila 0.08982843905687332
85
+ Tamaulipas 0.0630788803100586
86
+ Mexico_City 0.06246586889028549
87
+ Jalisco 0.06113814190030098
88
+ ```
89
+ Lo que indica que la región más probable es Nuevo León. Otros ejemplos son:
90
+ ```
91
+ for p in unmasker("<mask> _GEO Van a ganar los Xolos."):
92
+ print(p['token_str'], p['score'])
93
+ ```
94
+ ```
95
+ BC 0.23284225165843964
96
+ Jalisco 0.07845071703195572
97
+ Mexico_City 0.0761856958270073
98
+ Sinaloa 0.06842593103647232
99
+ Mexico 0.06353132426738739
100
+ ```
101
+ ```
102
+ for p in unmasker("<mask> _GEO Vamos para Pátzcuaro."):
103
+ print(p['token_str'], p['score'])
104
+ ```
105
+ ```
106
+ Michoacán 0.6461890339851379
107
+ Guanajuato 0.0919179916381836
108
+ Jalisco 0.07710094749927521
109
+ Sonora 0.022813264280557632
110
+ Yucatán 0.02254747971892357
111
+ ```
112
+ ```
113
+ for p in unmasker("<mask> _GEO Vamos para Mérida."):
114
+ print(p['token_str'], p['score'])
115
+ ```
116
+ ```
117
+ Yucatán 0.9046052694320679
118
+ QR 0.01990741863846779
119
+ Mexico_City 0.009980794973671436
120
+ Tabasco 0.009980794973671436
121
+ Jalisco 0.007273637689650059
122
+ ```
123
+ ## Información Regional
124
+
125
+ Usando las capas de atención, se extrajeron las palabras más importantes para elegir el token de región.
126
+ Esas palabras formaron el vocabulario asociado a cada una de las regiones.
127
+ Los vocabularios pudieron ser comparados para formar la siguiente matriz de similaridad.
128
+
129
+ ![algo](state_sim_full_small.png)
130
 
131
 
132
 
 
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ## Bias, Risks, and Limitations
136
 
 
138
 
139
  [More Information Needed]
140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
 
142
 
143
+ ## Evaluación
144
+ Este modelo es el que se indica como MexLarge en la siguiente tabla.
145
+ Los conjuntos de prueba son tweets escritos en México y se puede ver que los modelos
146
+ con información regional (MexSmall y MexLarge) tiene una clara ventaja contra las
147
+ alternativas.
148
 
149
+ |Dataset | MexSmall | MexLarge | BETO | RoBERTuito | BERTIN | Metric |
150
+ |----------|----------|----------|----------|----------|----------|----------|
151
+ |RegTweets | 0.7014 | 0.7244 | 0.6843 | 0.6689 | 0.7083 | macro-F1 |
152
+ |MexEmojis | 0.5044 | 0.5047 | 0.4223 | 0.4491 | 0.4832 | macro-F1 |
153
+ |HomoMex | 0.8131 | 0.8266 | 0.8099 | 0.8283 | 0.7934 | macro-F1 |
154
 
155
+ Los conjuntos de datos [RegTweets](https://huggingface.co/datasets/guillermoruiz/RegTweets) y [MexEmojis](https://huggingface.co/datasets/guillermoruiz/MexEmojis) están disponibles en Huggingface.
156
 
157
+ En la siguiente tabla se ven los resultados en textos en español genérico.
158
+ Se puede apreciar que los modelos con información regional son muy competitivos
159
+ a las alternativas.
160
 
161
+ | Dataset | MexSmall | MexLarge | BETO | RoBERTuito | BERTIN | Metric |
162
+ |----------|----------|----------|----------|----------|----------|----------|
163
+ | HAHA | 0.8208 | 0.8215 | 0.8238 | 0.8398 | 0.8063 | F1 |
164
+ | SemEval2018 Anger | 0.6435 | 0.6523 | 0.6148 | 0.6764 | 0.5406 | pearson |
165
+ | SemEval2018 Fear | 0.7021 | 0.6993 | 0.6750 | 0.7136 | 0.6809 | pearson |
166
+ | SemEval2018 Joy | 0.7220 | 0.7226 | 0.7124 | 0.7468 | 0.7391 | pearson |
167
+ | SemEval2018 Sadness | 0.7086 | 0.7072 | 0.6781 | 0.7475 | 0.6548 | pearson |
168
+ | SemEval2018 Valence | 0.8015 | 0.7994 | 0.7569 | 0.8017 | 0.6943 | pearson |
169
+ | HOPE | 0.7115 | 0.7036 | 0.6852 | 0.7347 | 0.6872 | macro-F1 |
170
+ | RestMex 3 | 0.7528 | 0.7499 | 0.7629 | 0.7588 | 0.7583 | Special |
171
+ | HUHU | 0.7849 | 0.7932 | 0.7887 | 0.8169 | 0.7938 | F1 |
172
 
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
  ## Citation [optional]
175
 
176
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
182
  **APA:**
183
 
184
  [More Information Needed]