File size: 985 Bytes
eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 eb823ba f066ef4 a424f3c f066ef4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Tokenizer
We trained our tokenizer using [sentencepiece](https://github.com/google/sentencepiece)'s unigram tokenizer. Then loaded the tokenizer as MT5TokenizerFast.
## Model
We used [MT5-base](https://huggingface.co/google/mt5-base) model.
## Datasets
We used [Code Search Net](https://huggingface.co/datasets/code_search_net)'s dataset and some scrapped data from internet to train the model. We maintained a list of datasets where each dataset had codes of same language.
## Plots
### Train loss
data:image/s3,"s3://crabby-images/44b0d/44b0db6b3346b8550649c2bbe7102b224a8b7c79" alt="train loss"
### Evaluation loss
data:image/s3,"s3://crabby-images/3301a/3301ab32f1ec717120aef54b55e446881b9af8d8" alt="eval loss"
### Evaluation accuracy
data:image/s3,"s3://crabby-images/46ab4/46ab47bcc195e974cdedb944b2713c8ea9c4c3f9" alt="eval accuracy"
### Learning rate
data:image/s3,"s3://crabby-images/ac585/ac585367d00a8ed2e1810de857c1b83638462139" alt="learning rate"
## Fine tuning (WIP)
We fine tuned the model with [CodeXGLUE code-to-code-trans dataset](https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans), and scrapper data.
|