Recent years, deep learning has achieved excellent performance in Software Engineering (SE) tasks. However, excellent performance relies on large-scale training sets, which prevents the application of deep learning techniques in practical tasks. With the release of pre-trained models (PTMs) in the field of deep learning, researchers in SE have begun to pay attention to PTMs, and introduced PTMs into SE tasks. PTMs has made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (PTM4SE: Pre-trained models for Software Engineering), we systematically review the current studies related to PTM4SE. Specifically, we firstly describe the framework of the intelligent software engineering methods based on pre-trained models. We then analyze and discuss the commonly used pre-trained models in SE. Meanwhile, we introduce the downstream tasks in SE with pre-trained models in detail, and compare and analyze the performance of pre-trained model techniques on these tasks. We then present the datasets used in SE for training and fine-tuning the PTMs. Finally, we discuss the challenges and opportunities for PTM4SE.
In order to solve the problem that intelligent SE methods based on DL methods require a large amount of labeled data, researchers in SE have proposed many methods with PTMs to solve SE-related tasks (i.e., pre-trained model-based intelligent software engineering methods). These methods apply a small amount of labeled data from SE downstream tasks to train an intelligent model based on the existing PTMs, which could achieve the final PTMs4SE intelligent method to solve SE downstream tasks (e.g., code generation, program repair, and issue report classification).
The specific construction process mainly includes four parts: SE downstream task data collection and processing, intelligent method construction based on pre-trained model, model training and model evaluation.
Fig.1 Research framework of intelligent software method based on pre-trained model
Since 2018, researchers in SE have begun to introduce different types of PTMs into the SE-related task. Thus, we collected the intelligent software engineer with the PTMs and divided them into four types:Off-the-shelf models, Domain-specific models, and source code models.
Fig.2 Distribution of pre-trained models used in the software engineering
Off-the-shelf Models are the pre-trained models that are trained on general domain datasets in the DL doamin, e.g., BERT (or GPT or XLNet) pre-trained models which are trained on English Wikipedia and general news datasets in Natural language processing (NLP), and ResNet and VGG models which are pretrained on ImageNet datasets in computer vision (CV). Thus, we divide the off-the-shelf models into two categories: off-the-shelf models in the NLP and off-the-shelf models in the CV.
Domain-specific models are the pre-trained models that are trained on the SE-specific datasets (e.g., GitHub, Stack Overflow, and JIRA). In recent years, researchers in SE have collected a large number of SE-specific datasets to re-train the DL models, such as SeBERT, Text-To-Text Transfer Transformer(T5) model, Word2Vec-SO, BERT-reviews, BERT-SO-1M, BERT-SO-1M-Large, and RoBERTa-SO in Fig. 2.
Source code models are the pre-trained models that are trained on source codes to understand the syntax and semantic information included in the source data. For now, researchers in SE have collected different language of source code to retrain the DL models, such as Code2Vec,CodeT5, CodeBERT, GraphCodeBERT, C-BERT, CuBERT, CodeBERT, and CodeBERT. PLBART, OSCAR, InferCode, and DOBF in Fig. 2.
Datasets as one of important components in PTMs affect the performance of PTMs for the SE-related tasks. To get the higher performance of intelligent software engineering methods, researchers in SE have collected different types of SE datasets to train or fine-tune the models. To present and understand the current SE datasets, we summarized and analyzed these datasets in SE, and divided them into PTMs datasets and SE-related downstream Task Dataset.
PTMs datasets are datasets that are used for Trained a DL model from scratch. The PTMs datasets frequently used in SE are listed in the table, which are also collected in our datasets files in this repository.
Type |
Dataset |
Programming Language |
Source |
Scale |
Open Time |
PTMs |
PL |
Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ C/ C# |
GitHub+BigQuery |
8.35G |
2021 |
CodeT5 |
|
GitHub C language repositories |
C |
GitHub |
5.8 G |
2020 |
C-BERT |
|
Java and TypeScript datasets |
Java/ TypeScript |
GitHub |
|
2020 |
CugLM |
|
Java datases |
Java |
GitHub |
|
2021 |
SynFix |
|
Java/ C/ C++ |
AtCoder+CodeJam |
17.6M |
2019 |
IR-BERT |
||
Java |
GitHub |
32G |
2020 |
InferCode Code2vec |
||
Python |
GitHub |
190M |
2020 |
CodeTrek |
||
unique Python files |
Python |
GitHub |
159GB |
2021 |
CodeX |
|
Java |
GitHub |
4.7M |
2020 |
Coder |
||
Python and Java pre-training corpus |
Java/ Python |
GitHub |
21.3M |
2021 |
CuBERT/ TreeBERT |
|
NL |
English |
Stack Overflow GitHub Jira |
119.7G |
2021 |
seBERT |
|
English |
Stack Overflow |
3.16M |
2020 |
BERTOverflow CosSensBERT |
||
NL+ PL |
Java/ English |
GitHub+SO |
52.5M |
2022 |
T5 |
|
Java/ English |
GitHub |
1.5M |
2021 |
T5 |
||
Java/ Python/ English |
BigQuery+GitHub |
655G |
2021 |
PLBART |
||
Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ English |
GitHub |
3.5 G |
2019 |
T-BERT Graphcodebert CodeBERT |
||
Python corpus of CodeSearchNet dataset |
Python |
GitHub |
1.6M |
2019 |
CLAWSAT/ CODE-MVP |
|
Java corpus of CodeSearchNet dataset |
Java |
GitHub |
2.0M |
2019 |
CLAWSAT |
|
AnghaBench |
C |
GitHub |
0.53M |
2020 |
COMBO |
SE-related downstream task datasets are the datasets used to fine tune the intellignet DL models for the se-related downstream tasks. Common SE-related downstream datasets frequently used are listed in the followed table.
Type |
Tasks |
Dataset |
Programming Language |
Scale |
Open Time |
PL |
Code Classification |
Java |
102041 |
2020 |
|
POJ104 |
C/C++ |
30815 |
2021 |
||
CodeCloneBench |
Java |
901028 |
2014 |
||
|
|
2016 |
|||
|
47398 |
2020 |
|||
C |
298917 |
2021 |
|||
C/C++ |
13600 |
2005 |
|||
C/C++ |
4919 |
2006 |
|||
C |
27318 |
2021 |
|||
Merge Conflcts Dataset |
C#/JavaScript/TypeScript/Java |
219934 |
2022 |
||
Java/C#/C++/Python/JavaScript |
2250000 |
2022 |
|||
C/C++ |
61638 |
2018 |
|||
C/C++ |
1274366 |
2018 |
|||
C/C++ |
18169 |
2020 |
|||
C/C++ |
181641 |
2019 |
|||
C/C++ |
1295623 |
2021 |
|||
Program Repair |
Java |
63923 |
2021 |
||
Java |
46680 |
2019 |
|||
JavaScript |
104804 |
2021 |
|||
Python/Java |
|
2017 |
|||
Java/Python/C/JavaScript |
9675342 |
2020 |
|||
JavaScript |
|
2016 |
|||
C |
10468 |
2015 |
|||
Code Completion |
Java and TypeScript datasets |
Java/TypeScript |
|
2020 |
|
Python |
74749 |
2020 |
|||
API Recommendation |
Req2Lib-dataset |
Java |
5625 |
2020 |
|
Code Translation |
Java/C# |
10300 |
2021 |
||
Python |
240000 |
2021 |
|||
NL |
Text Classification |
English |
|
2012 |
|
English |
1793 |
2021 |
|||
English |
|
2021 |
|||
|
10096 |
2019 |
|||
Review Responses Automatic Generation |
review-response pairs datasets |
English |
570881 |
2020 |
|
Link Prediction |
English |
1834 |
2021 |
||
PL+NL |
Code Generation |
Java |
100,000 |
2018 |
|
DJANGO |
Python |
18805 |
2015 |
||
Python |
13946 |
2019 |
|||
Python |
974 |
2021 |
|||
SQL |
5693 |
2018 |
|||
Python |
232421 |
2021 |
|||
C++/Python/Java |
13610 |
2022 |
|||
Python |
|
2021 |
|||
Code Summarization |
|
1600 |
2017 |
||
Java |
1953940 |
2020 |
|||
Ruby/JavaScript/GO/Python/Java/PHP |
|
2019 |
|||
Java projects from GitHub |
Java |
134239 |
|
||
Python |
30 000 |
2016 |
|||
Code Search |
Python |
280634 |
2021 |
||
Python/Java |
79809 |
2018 |
|||
Python |
13250 |
2020 |
|||
Python |
147546 |
2018 |
|||
Python |
20604 |
2021 |
|||
Code Review |
Python/Java/Go/C++/JavaScript/C/C#/Php/Ruby |
|
2022 |
||
Synthesis |
|
|
2021 |
||
CV |
UML Diagram Classification |
UML Diagram |
|
14815 |
2016 |
Researchers in SE have applied many PTMs into various SE-related tasks because of the powerful learning ability of PTMs. We summarized and analyzed these SE-related tasks with the PTMs. Meanwhile, based on the types of input datasets, we divided them into four types: programming language (PL) related tasks, natural language (NL) related tasks in the SE domain, the interaction task among PL and NL, and image related tasks in the SE domain.
Fig.3 Distribution of downstream tasks with pre-trained models in software engineering
PL-related tasks are the tasks to solve the problems through studying the syntactic and semantic feature representations of source code. The current main tasks and performance are listed in the followed table.
PL
|
Sub-Tasks |
PTMs |
Accuracy |
Precision |
Recall |
F1 |
MAP |
BLEU |
EM |
CodeBLEU |
MCC |
EditSIM |
Number of fixed bugs |
Code classification |
Commit classifcation |
BERT |
0.80 |
0.84 |
0.75 |
0.79 |
|
|
|
|
|
|
|
seBERT |
|
0.87 |
0.85 |
0.84 |
|
|
|
|
|
|
|
||
BERToverflow |
|
0.84 |
0.81 |
0.81 |
|
|
|
|
|
|
|
||
BERT-BASE |
|
0.77 |
0.73 |
0.75 |
|
|
|
|
|
|
|
||
Algorithm classification |
CodeBERT |
|
0.85 |
|
|
0.83 |
|
|
|
|
|
|
|
RoBERTa |
|
0.83 |
|
|
0.80 |
|
|
|
|
|
|
||
COMBO |
|
|
|
|
0.74 |
|
|
|
|
|
|
||
ResNet18 |
86.4 |
85.8 |
84.7 |
82.2 |
|
|
|
|
|
|
|
||
ResNet50 |
0.90 |
0.86 |
0.87 |
0.86 |
|
|
|
|
|
|
|
||
Technical Debt Detection |
BERT |
|
|
|
0.82 |
|
|
|
|
|
|
|
|
BERT-SO-1M |
|
|
|
0.82 |
|
|
|
|
|
|
|
||
StackOBERTflow |
|
|
|
0.81 |
|
|
|
|
|
|
|
||
BERT-comments |
|
|
|
0.81 |
|
|
|
|
|
|
|
||
Vulnerability Detection |
GraphCodeBERT |
|
0.92 |
0.92 |
0.92 |
|
|
|
|
|
|
|
|
CuBERT |
0.72 |
|
|
|
|
|
|
|
|
|
|
||
COMBO |
|
|
|
0.67 |
|
|
|
|
|
|
|
||
C-BERT |
0.62 |
|
|
|
|
|
|
|
|
|
|
||
CodeBERT |
0.64 |
|
|
0.54 |
|
|
|
|
|
|
|
||
PLBART |
0.63 |
|
|
|
|
|
|
|
|
|
|
||
ResNet18 |
|
0.89 |
0.89 |
0.89 |
|
|
|
|
|
|
|
||
ResNet50 |
|
0.91 |
0.91 |
0.91 |
|
|
|
|
|
|
|
||
Defect Detection |
CodeT5-base |
0.66 |
|
|
|
|
|
|
|
|
|
|
|
CodeT5-small |
0.63 |
|
|
|
|
|
|
|
|
|
|
||
CodeT5 |
0.64 |
|
|
0.60 |
|
|
|
|
0.27 |
|
|
||
PLBART |
0.63 |
|
|
|
|
|
|
|
|
|
|
||
CuBERT |
0.95 |
|
|
|
|
|
|
|
|
|
|
||
RoBERTa (code) |
0.61 |
|
|
|
|
|
|
|
|
|
|
||
BERT |
0.76 |
|
|
|
|
|
|
|
|
|
|
||
CodeBERT |
0.68 |
|
|
0.54 |
|
|
|
|
0.27 |
|
|
||
CodeBERTa |
0.70 |
|
|
0.59 |
|
|
|
|
0.27 |
|
|
||
GraphCodeBERT |
0.71 |
|
|
|
|
|
|
|
|
|
|
||
CODE-MVP |
0.89 |
|
|
|
|
|
|
|
|
|
|
||
SynCoBERT |
0.65 |
|
|
|
|
|
|
|
|
|
|
||
Clone Detection |
RoBERTa |
|
0.97 |
0.96 |
0.96 |
|
|
|
|
|
|
|
|
CodeBERT |
0.97 |
0.96 |
0.96 |
0.96 |
0.10 |
|
|
|
|
|
|
||
GraphCodeBERT |
0.97 |
0.97 |
0.97 |
0.97 |
|
|
|
|
|
|
|
||
CodeT5-small |
|
|
|
0.97 |
|
|
|
|
|
|
|
||
CodeT5-base |
|
|
0.95 |
0.97 |
|
|
|
|
|
|
|
||
RoBERTa (code) |
|
|
|
0.95 |
|
|
|
|
|
|
|
||
PLBART |
|
|
0.95 |
0.97 |
|
|
|
|
|
|
|
||
code2vec |
|
0.82 |
0.40 |
0.60 |
|
|
|
|
|
|
|
||
T5 |
|
|
|
|
0.70 |
|
|
|
|
|
|
||
OSCAR |
|
|
|
|
0.49 |
|
|
|
|
|
|
||
COMBO |
|
|
0.64 |
|
|
|
|
|
|
|
|
||
InferCode |
|
0.90 |
0.56 |
0.75 |
|
|
|
|
|
|
|
||
SynCoBERT |
|
0.97 |
0.98 |
0.97 |
0.88 |
|
|
|
|
|
|
||
SCodeR |
|
0.95 |
0.96 |
0.95 |
0.92 |
|
|
|
|
|
|
||
UniXcoder |
|
0.98 |
0.93 |
0.95 |
0.91 |
|
|
|
|
|
|
||
Program Repair |
CodeT5-base |
|
|
|
|
|
0.77 |
0.22 |
|
|
|
|
|
CodeT5-small |
|
|
|
|
|
0.76 |
0.19 |
|
|
|
|
||
RoBERTa |
0.75 |
|
|
|
|
|
|
|
|
|
|
||
RoBERTa (code) |
|
|
|
|
|
0.77 |
0.16 |
|
|
|
|
||
CodeBERT |
0.72 |
|
|
|
|