Update README.md

We developed a language model for Telugu using the dataset called Telugu_books, which is from the Kaggle platform, and the dataset contains Telugu data,
there are only a few language models are developed for regional languages like Telugu, Hindi, Kannada...etc,
so we built a dedicated language model especially for the Telugu language.
The model aim is to predict a Telugu word that is masked in a given Telugu sentence by using Masked Language Modeling of BERT [Bidirectional Encoder Representation from Transformers]
and we achieved state-of-the-art performance in it.

Files changed (1) hide show

README.md +48 -3

README.md CHANGED Viewed

@@ -5,6 +5,12 @@ tags:
 model-index:
 - name: xlm-roberta-base-finetuned-wikitext2
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -18,11 +24,16 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
@@ -30,6 +41,40 @@ More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -55,4 +100,4 @@ The following hyperparameters were used during training:
 - Transformers 4.24.0
 - Pytorch 1.12.1+cu113
 - Datasets 2.7.1
-- Tokenizers 0.13.2

 model-index:
 - name: xlm-roberta-base-finetuned-wikitext2
   results: []
+language:
+- en
+metrics:
+- accuracy
+- code_eval
+pipeline_tag: text-generation
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 ## Model description
+We developed a language model for Telugu using the dataset called Telugu_books, which is from the Kaggle platform, and the dataset contains Telugu data,
+there are only a few language models are developed for regional languages like Telugu, Hindi, Kannada...etc,
+so we built a dedicated language model especially for the Telugu language.
+The model aim is to predict a Telugu word that is masked in a given Telugu sentence by using Masked Language Modeling of BERT [Bidirectional Encoder Representation from Transformers]
+and we achieved state-of-the-art performance in it.
 ## Intended uses & limitations
+Using this model we can predict the exact and contextual word which is already masked in a given Telugu sentence and we achieved state-of-the-art performance in it.
 ## Training and evaluation data
 ## Training procedure
+Step-1: Collecting Data
+From the Kaggle Telugu dataset is collected. It contains Telugu paragraphs from
+different books.
+Step2: Pre-processing Data
+The collected data is pre-processed using different pre-processing techniques
+and splitting the large Telugu Sentence into small sentences.
+Step-3: Connecting to Hugging Face
+Hugging Face provides a token with which we can log in using a notebook
+function and the rest of the work we do will be exported to the platform
+automatically.
+Step-4: Loading pre-trained model and tokenizer
+The pre-trained model and tokenizer from xlm-roberta-base are loaded for
+training our Telugu data
+Step-5: Training the model
+Required libraries like Trainer and Training arguments are imported from
+Transformers library. The after giving the Training arguments with our data we
+train the model using the train() method which takes 1 to 1 ½ hours depending upon
+the size of our input data
+Step-6: Pushing model and tokenizer
+Then trainer.push_to_hub() and tokenizer.push_to_hub() methods are used to
+export our trained model and its tokenizers which are used for the mapping of
+words in prediction.
+Step-7: Testing
+In the hugging face after opening our model page there is an API in which We
+give a Telugu Sentence as input with <mask> keyword and click the compute
+button then the predicted words with their probabilities are displayed. Then we
+check that words with the actual words and evaluated
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - Transformers 4.24.0
 - Pytorch 1.12.1+cu113
 - Datasets 2.7.1
+- Tokenizers 0.13.2