Update README.md
Browse files
README.md
CHANGED
@@ -1,25 +1,27 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
-
|
5 |
|
6 |
-
# mRAT-SQL-FIT
|
7 |
-
Code and model from the paper [paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256).
|
8 |
|
|
|
9 |
## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
|
10 |
Marcelo Archanjo Jose, Fabio Gagliardi Cozman
|
11 |
|
12 |
Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
|
13 |
|
|
|
|
|
|
|
14 |
# mRAT-SQL+GAP
|
15 |
-
|
16 |
|
17 |
## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer
|
18 |
Marcelo Archanjo José, Fabio Gagliardi Cozman
|
19 |
|
20 |
The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
|
21 |
|
22 |
-
[paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546).
|
23 |
|
24 |
Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
|
25 |
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
Code, model and datasets: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
|
5 |
|
|
|
|
|
6 |
|
7 |
+
# mRAT-SQL-FIT
|
8 |
## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
|
9 |
Marcelo Archanjo Jose, Fabio Gagliardi Cozman
|
10 |
|
11 |
Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
|
12 |
|
13 |
+
[paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256).
|
14 |
+
|
15 |
+
|
16 |
# mRAT-SQL+GAP
|
17 |
+
|
18 |
|
19 |
## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer
|
20 |
Marcelo Archanjo José, Fabio Gagliardi Cozman
|
21 |
|
22 |
The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
|
23 |
|
24 |
+
BRACIS 2021: [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546).
|
25 |
|
26 |
Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
|
27 |
|