guillermoruiz
/

mex_large

Fill-Mask

Transformers

Safetensors

Spanish

roberta

Model card Files Files and versions Community

guillermoruiz commited on Jan 23

Commit

3a6db32

verified ·

1 Parent(s): 0a13215

Update README.md

Browse files

Files changed (1) hide show

README.md +148 -163

README.md CHANGED Viewed

@@ -1,59 +1,136 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
@@ -61,113 +138,39 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 [More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -179,21 +182,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 **APA:**
 [More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+- es
+license: mit
 library_name: transformers
 ---
+# Modelo de Lenguaje para el español de México
 <!-- Provide a quick summary of what the model is/does. -->
+Este modelo basado en Roberta se entrenó usando más de 140 millones de tweets de México en español. Recolectados entre diciembre del 2015 y febrero del 2023.
+A cada mensaje se le agruegó una etiqueta de información regionalizada como sigue:
+*estado* _GEO *mensaje*
+Algunos ejemplos son:
+- Coahuila _GEO Cómo estás amiga, nos conocemos? Soy soltero busco soltera. #PiedrasNegras #nava #allende #zaragoza
+- Tamaulipas _GEO Ando de buenas que ya les devolví sus unfollows y métanselos por el culo ☺.
+- BCS _GEO Ésa canción que cantas en silencio y la otra persona tmb. Bn raro.
+- Tamaulipas _GEO Hoy es la primera vez que manejo en estado de ebriedad 😞🙃
+Como se puede observar, se mantuvieron mayúsculas y minúsculas, emoticones y palabras mal escritas.
+Por motivos de privacidad, se cambiaron las mensciones de usuario por el token _USR y las direcciones de
+internet por _URL.
+Los tokens que idican el estado de la república son:
+|Estado|Token|
+|----------|----------|
+|Aguascalientes|Aguascalientes|
+|Baja California|BC|
+|Baja California Sur|BCS|
+|Campeche|Campeche|
+|Chiapas|Chiapas|
+|Chihuahua|Chihuahua|
+|Ciudad de México|Mexico_City|
+|Coahuila de Zaragoza|Coahuila|
+|Colima|Colima|
+|Durango|Durango|
+|Guanajuato|Guanajuato|
+|Guerrero|Guerrero|
+|Hidalgo|Hidalgo|
+|Jalisco|Jalisco|
+|Michoacán de Ocampo|Michoacán|
+|Morelos|Morelos|
+|México|Mexico|
+|Nayarit|Nayarit|
+|Nuevo León|NL|
+|Oaxaca|Oaxaca|
+|Puebla|Puebla|
+|Querétaro|Querétaro|
+|Quintana Roo|QR|
+|San Luis Potosí|SLP|
+|Sinaloa|Sinaloa|
+|Sonora|Sonora|
+|Tabasco|Tabasco|
+|Tamaulipas|Tamaulipas|
+|Tlaxcala|Tlaxcala|
+|Veracruz de Ignacio de la Llave|Veracruz|
+|Yucatán|Yucatán|
+|Zacatecas|Zacatecas|
+Se creó el vocabulario de tamaño 30k usando WordPiece. El modelo se entrenaron usando el enmascaramiento de palabras con probabilidad de 0.15.
+Se usó el optimizador AdamW con una tasa de aprendizaje de 0.00002 durante una época.
+## Uso
+El modelo se puede usar con una `pipeline`:
+```
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model="guillermoruiz/mex_state")
+```
+```
+for p in unmasker("<mask> _GEO Van a ganar los Tigres."):
+    print(p['token_str'], p['score'])
+```
+Lo que produce la salida:
+```
+NL 0.2888392508029938
+Coahuila 0.08982843905687332
+Tamaulipas 0.0630788803100586
+Mexico_City 0.06246586889028549
+Jalisco 0.06113814190030098
+```
+Lo que indica que la región más probable es Nuevo León. Otros ejemplos son:
+```
+for p in unmasker("<mask> _GEO Van a ganar los Xolos."):
+    print(p['token_str'], p['score'])
+```
+```
+BC 0.23284225165843964
+Jalisco 0.07845071703195572
+Mexico_City 0.0761856958270073
+Sinaloa 0.06842593103647232
+Mexico 0.06353132426738739
+```
+```
+for p in unmasker("<mask> _GEO Vamos para Pátzcuaro."):
+    print(p['token_str'], p['score'])
+```
+```
+Michoacán 0.6461890339851379
+Guanajuato 0.0919179916381836
+Jalisco 0.07710094749927521
+Sonora 0.022813264280557632
+Yucatán 0.02254747971892357
+```
+```
+for p in unmasker("<mask> _GEO Vamos para Mérida."):
+    print(p['token_str'], p['score'])
+```
+```
+Yucatán 0.9046052694320679
+QR 0.01990741863846779
+Mexico_City 0.009980794973671436
+Tabasco 0.009980794973671436
+Jalisco 0.007273637689650059
+```
+## Información Regional
+Usando las capas de atención, se extrajeron las palabras más importantes para elegir el token de región.
+Esas palabras formaron el vocabulario asociado a cada una de las regiones.
+Los vocabularios pudieron ser comparados para formar la siguiente matriz de similaridad.
+![algo](state_sim_full_small.png)
 ## Bias, Risks, and Limitations
 [More Information Needed]
+## Evaluación
+Este modelo es el que se indica como MexLarge en la siguiente tabla.
+Los conjuntos de prueba son tweets escritos en México y se puede ver que los modelos
+con información regional (MexSmall y MexLarge) tiene una clara ventaja contra las
+alternativas.
+|Dataset    |  MexSmall  |  MexLarge  |  BETO  |  RoBERTuito  |  BERTIN  |  Metric |
+|----------|----------|----------|----------|----------|----------|----------|
+|RegTweets  |  0.7014  |  0.7244  |  0.6843  |  0.6689  |  0.7083 |  macro-F1 |
+|MexEmojis  |  0.5044  |  0.5047  |  0.4223  |  0.4491  |  0.4832 |  macro-F1 |
+|HomoMex    |  0.8131  |  0.8266  |  0.8099  |  0.8283  |  0.7934 |  macro-F1 |
+Los conjuntos de datos [RegTweets](https://huggingface.co/datasets/guillermoruiz/RegTweets) y [MexEmojis](https://huggingface.co/datasets/guillermoruiz/MexEmojis) están disponibles en Huggingface.
+En la siguiente tabla se ven los resultados en textos en español genérico.
+Se puede apreciar que los modelos con información regional son muy competitivos
+a las alternativas.
+| Dataset  |  MexSmall  |  MexLarge  |  BETO  |  RoBERTuito  | BERTIN |  Metric  |
+|----------|----------|----------|----------|----------|----------|----------|
+| HAHA  |  0.8208  | 0.8215  | 0.8238  |  0.8398  |  0.8063 |  F1  |
+| SemEval2018 Anger  | 0.6435  |  0.6523 | 0.6148  | 0.6764  | 0.5406  |  pearson |
+| SemEval2018 Fear  |  0.7021  | 0.6993  |  0.6750  | 0.7136  |  0.6809 | pearson |
+| SemEval2018 Joy  |  0.7220 | 0.7226  |  0.7124 | 0.7468  |  0.7391 | pearson |
+| SemEval2018 Sadness  |  0.7086 | 0.7072  | 0.6781  | 0.7475  | 0.6548  | pearson |
+| SemEval2018 Valence  | 0.8015  |  0.7994 | 0.7569  |  0.8017 | 0.6943  | pearson |
+| HOPE  |  0.7115  |  0.7036  |  0.6852  |  0.7347  |  0.6872 |  macro-F1 |
+| RestMex 3  |  0.7528  |  0.7499  |  0.7629  |  0.7588  |  0.7583 |  Special |
+| HUHU  |  0.7849  |  0.7932  |  0.7887  |  0.8169  | 0.7938  |  F1 |
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **APA:**
 [More Information Needed]