File size: 15,384 Bytes
6705032
 
 
 
 
 
 
 
 
 
 
 
a9b1c78
6705032
 
 
 
 
 
0cea2c9
 
 
6705032
 
0e72e2e
a9b1c78
0e72e2e
 
a9b1c78
0e72e2e
 
6705032
 
 
a9b1c78
6705032
 
a9b1c78
 
 
6705032
33264ad
6705032
33264ad
ab1df22
33264ad
 
ab1df22
33264ad
 
 
1bee9bf
33264ad
a9b1c78
1d488b6
 
1eea12c
 
1d488b6
 
a9b1c78
33264ad
 
 
 
 
 
 
 
6705032
7a9df24
6705032
7a9df24
6705032
83403b0
6705032
 
 
 
 
 
 
 
 
 
 
405c83f
6705032
 
0cea2c9
83403b0
 
 
0cea2c9
83403b0
 
7a9df24
6705032
83403b0
 
 
 
 
 
7a9df24
83403b0
 
 
 
 
 
6705032
0cea2c9
 
 
6705032
 
 
 
7a9df24
6705032
0cea2c9
6705032
0cea2c9
1d488b6
6705032
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405c83f
6705032
 
0cea2c9
6705032
e8bf76c
2ca5da0
33264ad
2ca5da0
405c83f
0cea2c9
0d418f2
3708a78
 
 
fcd9447
2ca5da0
 
 
 
 
 
0cea2c9
2ca5da0
 
0cea2c9
2ca5da0
83403b0
 
2ca5da0
 
 
 
 
 
 
 
33264ad
6705032
 
 
 
 
 
7a9df24
 
 
 
 
 
 
 
33264ad
6705032
 
 
 
 
 
33264ad
6705032
 
 
 
 
0cea2c9
6705032
 
33264ad
6705032
 
7a9df24
6705032
 
 
 
 
 
 
0cea2c9
33264ad
6705032
 
 
 
 
 
0cea2c9
6705032
 
 
 
 
 
 
 
33264ad
6705032
 
 
 
 
0cea2c9
6705032
 
 
 
33264ad
6705032
33264ad
6705032
 
 
 
33264ad
6705032
 
 
 
 
0cea2c9
6705032
 
 
33264ad
6705032
7a9df24
6705032
 
 
33264ad
6705032
 
 
 
 
0cea2c9
6705032
 
33264ad
6705032
33264ad
6705032
 
 
 
33264ad
6705032
 
 
 
 
0cea2c9
6705032
 
 
33264ad
6705032
7a9df24
6705032
1c91d8e
81b9afd
 
 
6705032
33264ad
6705032
 
81b9afd
 
0cea2c9
81b9afd
6705032
33264ad
6705032
81b9afd
6705032
7a9df24
6705032
81b9afd
 
6705032
33264ad
6705032
 
81b9afd
 
0cea2c9
405c83f
 
 
 
7a9df24
405c83f
de6b003
 
09412dd
 
 
 
 
 
 
c796c24
09412dd
 
b937d1e
09412dd
 
c796c24
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
pretty_name: "VEGA"
language: 
  - code
tags:
  - C++/C Code
  - Compiler Backend
license: "cc-by-4.0"
---

# VEGA: Automatically Generating Compiler Backends Using a Pre-Trained Transformer Model

VEGA is an AI-driven system aimed at easing the development of compiler backends for new targets. This repository contains code and data for replicating experimental results. 


## 1. Directory Structure
```
VEGA_AE
|──dataset
|──models
|   |──FT_Model
|   |──New_FT_Model
|   └──UnixCoder
|——Scripts
    |──Exp
    |   |──Acc
    |   |──Correction
    |   |──ForkFlow
    |   |──Perf
    |   └──Time
    └──UnixCoder
```
## 2. Hardware Dependency

- 8 Nvidia Tesla V100 GPU, each with 16 GB memory.

## 3. Software Dependency
- CUDA == 11.7.
- python version == 3.8.1.
- Conda (Any version that supports the installation of Python 3.8.1).

## 4. Installation


- Download the artifact from https://huggingface.co/docz1105/VEGA_AE.

```
$ git lfs clone https://huggingface.co/docz1105/VEGA_AE
$ cd VEGA_AE
```

- Set up a Conda virtual environment.

We provide a pre-packaged conda virtual environment in ```./vega_ae.yml```, which includes specific versions of python and required extension packages. The python environment can be directly created using the following command.

```
$ conda env create -f vega_ae.yml
$ conda activate vega_ae
```

Alternatively, another way for creating a Conda environment is also available.
```
$ conda create -n vega_ae python=3.8.1
$ conda activate vega_ae
$ pip install -r requirements.txt
```


## 5. Code Generation

We have provided a fine-tuned model in ```./models/FT_Model```, which is fine-tuned with ```./dataset/train.jsonl``` and ```./dataset/valid.jsonl```.  The ```train.jsonl``` and ```valid.jsonl``` files contain function templates, feature vectors and ground truth for 98 backends (excluding RISC-V, RI5CY, xCORE) in our dataset.

We have also provided a script fot functionality test, which only generates a single function for RI5CY (Recorded as PULP in our dataset), taking less than 3 minutes with 8 Nvidia Tesla V100 GPUs.

- **Run functionality test with:**

```
$ bash run_function_test.sh
```

When the ```run_function_test.sh``` script begins execution, the command line displays
```
$ " Start Function Inferencing !"
```
Upon completion of the code generation, the script outputs
```
$ " Finished Function Inferencing."
``` 

The inference result will be saved in ```./models/FT_Model/result.jsonl```.

Check the generated code with:
```
$ cat ./models/FT_Model/result.jsonl
```

In the `result.jsonl` file, the meaning of each item in an entry can be found in the following table:


| Item | Description | 
| ---- | ----|
| vega_code | The model-generated code. |
| ans_code | The ground truth of the code. |
| vega_pre | The model-generated confidence score. |
| ans_pre | The ground truth of the confidence score. |
| File | The file to which this item belongs. |
| Function | The function to which this item belongs. |
| Module | The function module to which this item belongs. |
| Target | The target to which this item belongs. Note that we use "PULP" to represent "RI5CY" in our dataset. |

- **Run code generation with:**


The fine-tuned model will take function templates and feature vectors for RISC-V, RI5CY, and xCORE from ```./dataset/test.jsonl``` as input, generating code and confidence scores automatically.

```
$ bash run_test.sh
```

Customize parameters for code generation by modifying following options in the ```run_test.sh```.
```
 --model_name_or_path ../../models/UnixCoder \
 --test_filename ../../dataset/test.jsonl \
 --output_dir ../../models/FT_Model \
 --beam_size 1 \
 --train_batch_size 256 \
 --eval_batch_size 256 \
 --learning_rate 6e-5 \
 --gradient_accumulation_steps 2 \
 --num_train_epochs 10 \
 --mse_loss_weight 0.9  \
 --ce_loss_weight 0.1
```
Users can inference with their own fine-tuned model by changing the ```--output_dir``` option.



When the ```run_test.sh``` script begins execution, the command line displays
```
$ " Start Inferencing !"
```
Upon completion of the code generation, the script outputs
```
$ " Finished Inferencing."
``` 

The inference result will be saved in ```./models/FT_Model/result.jsonl```.

Note that if a ```./models/FT_Model/result.jsonl``` file already exists, it will be **overwritten** after the execution of ```run_function_test.sh``` or ```run_test.sh```.

## 6. Fine-Tuning (**Optional**)


We provide the original UnixCoder-base-nine in ```./models/UnixCoder```. The original UnixCoder-base-nine can also be downloaded from HuggingFace: https://huggingface.co/microsoft/unixcoder-base-nine.


The original UnixCoder-base-nine will be fine-tuned with the provided ```./dataset/train.jsonl``` and ```./dataset/valid.jsonl``` by the following command.

- **Run fine-tuning with:**
```
$ bash run_fine_tuning.sh
```

Customize parameters for fine-tuning by modifying following options in the ```run_fine_tuning.sh```.
```
  --model_name_or_path ../../models/UnixCoder \
  --train_filename ../../dataset/train.jsonl \
  --dev_filename ../../dataset/valid.jsonl \
  --output_dir ../../models/New_FT_Model \
  --beam_size 4 \
  --train_batch_size 64 \
  --eval_batch_size 48 \
  --learning_rate 6e-5 \
  --num_train_epochs 50 \
  --mse_loss_weight 0.9 \
  --ce_loss_weight 0.1
```
The fine-tuned model will be saved in ```--output_dir```.


## 7. Reproducing Results in the Experiment

We provide the scripts to reproduce each Figure/Table from the paper, along with the corresponding output result files, in the following table:


| Script | Description | Output | Figure/Table |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------- |
| ./Scripts/Exp/Time/gen_time.py 	| Calculate the time overhead for VEGA to generate three backends.                                                                                                           	| ./Scripts/Exp/Time/Fig7.csv                 	| Fig.7        	|
| ./Scripts/Exp/Acc/gen_accuracy.py 	| Calculate the function-level accuracy of three VEGA-generated backends.                                                                                                 	| ./Scripts/Exp/Acc/Fig8_Acc.csv               	| Fig.8        	|
| ./Scripts/Exp/Acc/gen_purple.py 	| Calculate the results of Purple Bar in Fig. 8. 	| ./Scripts/Exp/Acc/Fig8_Purple.csv          	| Fig.8        	|
| ./Scripts/Exp/Acc/gen_accuracy.py 	| Calculate the percentage of three types of errors in three VEGA-generated backends.                                                                                     	| ./Scripts/Exp/Acc/Table2.csv            	| Table.2      	|
| ./Scripts/Exp/ForkFlow/gen_forkflow.py 	| Calculate the statement-level accuracy of VEGA-generated backends and ForkFlow-generated backends.                                                                          	| ./Scripts/Exp/ForkFlow/Fig9.csv           	| Fig.9        	|
| ./Scripts/Exp/ForkFlow/gen_forkflow.py 	| Calculate the number of statements accurately generated and requiring manual correction by VEGA of three backends.                                                               	| ./Scripts/Exp/ForkFlow/Table3.csv 	                | Table.3      	|
| ./Scripts/Exp/Correction/gen_correct.py | Calculate time required by two developers to modify the  VEGA-generated  RISC-V backend.                                  	| ./Scripts/Exp/Correction/Table4.csv        	| Table. 4     	|
| ./Scripts/Exp/Perf/gen_perf.py  	| Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0) on three benchmarks.                                    	| ./Scripts/Exp/Perf/Fig10.csv             	| Fig. 10      	|
### 7.1 Results for Fig. 7

In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.


- Command:
```
$ python ./Scripts/Exp/Time/gen_time.py
```


- Results:
```
$ cat ./Scripts/Exp/Time/Fig7.csv
```

### 7.2 Results for Fig. 8


In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-consuming, as a single regression test run generally takes about half an hour. Thus, sequentially testing all 1,454 VEGA-generated functions across three targets would require approximately 727 hours.

To simplify this process, we recorded the ground truth for each statement based on the Pass@1 experiment results. Additionally, we documented a list of functions containing Err-Def errors (i.e., errors due to missing necessary statements in the function template; functions with Err-Def can not pass all regression tests). This allowed us to transform the Pass@1 testing process into an Exact Match evaluation. 

In this Exact Match evaluation, each statement is deemed correct if the VEGA-generated code matches the ground truth and the confidence score aligns. A function is considered correct if all statements within it are accurate and it is free from Err-Def errors.

- Command:
```
$ cp ./models/FT_Model/result.jsonl ./Scripts/Exp/Acc
$ python ./Scripts/Exp/Acc/gen_accuracy.py
```

This script will automatically analyze the VEGA's output from "result.jsonl" and compare the generated code and confidence scores with the ground truth. Based on this comparison, it will determine whether each function is correct.

- Accuracy Results:
```
$ cat ./Scripts/Exp/Acc/Fig8_Acc.csv
```


We also provide a script for calculating the proportion of "Accurate Functions with Integrated Statements Across Multiple Targets". The value corresponding to the purple bar in Fig. 8.


- Command:
```
$ python ./Scripts/Exp/Acc/gen_purple.py
```


- Results:
```
$ cat ./Scripts/Exp/Acc/Fig8_Purple.csv
```



### 7.3 Results for Table. 2

Executing the script in 7.2 will also yield the proportion of the three types of errors for each target.


- Command:
```
$ python ./Scripts/Exp/Acc/gen_accuracy.py
```


- Results:
```
$ cat ./Scripts/Exp/Acc/Table2.csv
```


### 7.4 Results for Fig. 9

We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ```./Scripts/Exp/ForkFlow/Mips_Code``` directory, along with manually fixed code for the RISC-V, RI5CY, and xCORE LLVM backends in ```./Scripts/Exp/ForkFlow/Std_Code```. Additionally, the script in 7.2 will automatically write the VEGA-generated code from ```result.jsonl``` into the ```./Scripts/Exp/ForkFlow/VEGA_Code``` directory for comparison. By executing the following script, the proportion of accurate and modified statements of the VEGA-generated functions and ForkFlow processes will be automatically calculated.

- Command:
```
$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py
```


- Results:
```
$ cat ./Scripts/Exp/ForkFlow/Fig9.csv
```

### 7.5 Results for Table. 3

Executing the script in 7.4 will also output the number of statements accurately generated and requiring manual correction by VEGA across seven function modules for RISC-V, RI5CY, and xCORE.


- Command:
```
$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py
```


- Results:
```
$ cat ./Scripts/Exp/ForkFlow/Table3.csv
```


### 7.6 Results for Table. 4

The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are provided.

The following script computes the total time spent by Developers A and B to modify each **function module** in the VEGA-generated RISC-V backend, based on the recorded times for each **function**.

- Command:
  
```
$ python ./Scripts/Exp/Correction/gen_correct.py
```

- Results:
```
$ cat ./Scripts/Exp/Correction/Table4.csv
```

### 7.7 Results for Fig. 10

Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.

Running the following script will automatically calculate the speedup of the VEGA-generated LLVM backend (LLVM-VEGA) with the "-O3" optimization over the performance of the official LLVM backend (LLVM-Base) with "-O0", as well as the speedup of LLVM-Base with "-O3" over its own performance with "-O0".


- Command:
```
$ python ./Scripts/Exp/Perf/gen_perf.py
```

- Results:
```
$ cat ./Scripts/Exp/Perf/Fig10.csv
```



## 8. Experiment Customization

Users can run this experiment in different software environments, but they must ensure that PyTorch version is compatible with the CUDA version in those software environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, if the model is re-fine-tuned under different hardware or software environments, the accuracy of the newly fine-tuned model may exhibit slight variations.

We further conducted code generation tests on a machine with **an Nvidia A100 GPU (80GB memory)** and **CUDA Version == 12.0**. Under the provided Conda virtual environment, the experimental results showed a **25-minute reduction in the time overhead** of the code generation process (Fig. 7). This reduction is due to the A100 GPU's higher computational efficiency compared to the V100, as well as the additional time costs in the previous setup with 8 V100 GPUs caused by synchronization requirements across multiple GPUs. Notably, **code accuracy remained unchanged** (Fig. 8, Fig. 9, Table. 2, Table. 3). This confirms that our experiment is adaptable across different hardware and software environments.





## Citation
```
@inproceedings{zhong2025vega,
  title={VEGA: Automatically Generating Compiler Backends Using a Pre-Trained Transformer Model},
  author={Ming Zhong, Fang Lv, Lulin Wang, Lei Qiu, Yingying Wang, Ying Liu, Huimin Cui, Xiaobing Feng, Jingling Xue},
  booktitle={2025 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
  year={2025}
}
```