|
[tokenizer](#tokenizer) | [model](#model) | [datasets](#datasets) | [plots](#plots) | [fine tuning](#fine-tuning) |
|
|
|
# Tokenizer {#tokenizer} |
|
|
|
We trained our tokenizer using [sentencepiece](https://github.com/google/sentencepiece)'s unigram tokenizer. Then loaded the tokenizer as MT5TokenizerFast. |
|
|
|
## Model {#model} |
|
|
|
We used [MT5-base](https://huggingface.co/google/mt5-base) model. |
|
|
|
## Datasets {#datasets} |
|
|
|
We used [Code Search Net](https://huggingface.co/datasets/code_search_net)'s dataset and some scrapped data from internet to train the model. We maintained a list of datasets where each dataset had codes of same language. |
|
|
|
## Plots {#plots} |
|
|
|
[train loss](#train_loss) | [evaluation loss](#eval_loss) | [evaluation accuracy](#eval_acc) | [learning rate](#lrs) |
|
|
|
### Train loss {#train_loss} |
|
|
|
data:image/s3,"s3://crabby-images/f7b02/f7b02d592f1b32cc443d2aa1446662bff2d02d8a" alt="train loss" |
|
|
|
### Evaluation loss {#eval_loss} |
|
|
|
data:image/s3,"s3://crabby-images/226bc/226bce1358f502510d34229299ce0d0b6b83cdad" alt="eval loss" |
|
|
|
### Evaluation accuracy {#eval_acc} |
|
|
|
data:image/s3,"s3://crabby-images/02054/02054d0718997fe1bd7d90ead0ef890ff7873864" alt="eval accuracy" |
|
|
|
### Learning rate {#lrs} |
|
|
|
data:image/s3,"s3://crabby-images/559ff/559ff6d4843940bd7459eb5e313dc6d9e6967a14" alt="learning rate" |
|
|
|
## Fine tuning {#fine-tuning} |
|
|
|
We fine tuned the model with [CodeXGLUE code-to-code-trans dataset](https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans), and scrapper data. |
|
|