Marchanjo commited on
Commit
7253e96
·
verified ·
1 Parent(s): 140ff7f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -1,25 +1,27 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- more information: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
5
 
6
- # mRAT-SQL-FIT
7
- Code and model from the paper [paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256).
8
 
 
9
  ## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
10
  Marcelo Archanjo Jose, Fabio Gagliardi Cozman
11
 
12
  Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
13
 
 
 
 
14
  # mRAT-SQL+GAP
15
- Code and model from BRACIS 2021:
16
 
17
  ## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer
18
  Marcelo Archanjo José, Fabio Gagliardi Cozman
19
 
20
  The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
21
 
22
- [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546).
23
 
24
  Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
25
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ Code, model and datasets: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
5
 
 
 
6
 
7
+ # mRAT-SQL-FIT
8
  ## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
9
  Marcelo Archanjo Jose, Fabio Gagliardi Cozman
10
 
11
  Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
12
 
13
+ [paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256).
14
+
15
+
16
  # mRAT-SQL+GAP
17
+
18
 
19
  ## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer
20
  Marcelo Archanjo José, Fabio Gagliardi Cozman
21
 
22
  The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
23
 
24
+ BRACIS 2021: [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546).
25
 
26
  Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
27