PTM4SE

Recent years, deep learning has achieved excellent performance in Software Engineering (SE) tasks. However, excellent performance relies on large-scale training sets, which prevents the application of deep learning techniques in practical tasks. With the release of pre-trained models (PTMs) in the field of deep learning, researchers in SE have begun to pay attention to PTMs, and introduced PTMs into SE tasks. PTMs has made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (PTM4SE: Pre-trained models for Software Engineering), we systematically review the current studies related to PTM4SE. Specifically, we firstly describe the framework of the intelligent software engineering methods based on pre-trained models. We then analyze and discuss the commonly used pre-trained models in SE. Meanwhile, we introduce the downstream tasks in SE with pre-trained models in detail, and compare and analyze the performance of pre-trained model techniques on these tasks. We then present the datasets used in SE for training and fine-tuning the PTMs. Finally, we discuss the challenges and opportunities for PTM4SE.

1. Research Framework

In order to solve the problem that intelligent SE methods based on DL methods require a large amount of labeled data, researchers in SE have proposed many methods with PTMs to solve SE-related tasks (i.e., pre-trained model-based intelligent software engineering methods). These methods apply a small amount of labeled data from SE downstream tasks to train an intelligent model based on the existing PTMs, which could achieve the final PTMs4SE intelligent method to solve SE downstream tasks (e.g., code generation, program repair, and issue report classification).

The specific construction process mainly includes four parts: SE downstream task data collection and processing, intelligent method construction based on pre-trained model, model training and model evaluation.

Fig.1 Research framework of intelligent software method based on pre-trained model

2. PTMs in SE

Since 2018, researchers in SE have begun to introduce different types of PTMs into the SE-related task. Thus, we collected the intelligent software engineer with the PTMs and divided them into four types:Off-the-shelf models, Domain-specific models, and source code models.

Fig.2 Distribution of pre-trained models used in the software engineering

2.1 General pre-trained Models

Off-the-shelf Models are the pre-trained models that are trained on general domain datasets in the DL doamin, e.g., BERT (or GPT or XLNet) pre-trained models which are trained on English Wikipedia and general news datasets in Natural language processing (NLP), and ResNet and VGG models which are pretrained on ImageNet datasets in computer vision (CV). Thus, we divide the off-the-shelf models into two categories: off-the-shelf models in the NLP and off-the-shelf models in the CV.

2.2 Domain-specific Models

Domain-specific models are the pre-trained models that are trained on the SE-specific datasets (e.g., GitHub, Stack Overflow, and JIRA). In recent years, researchers in SE have collected a large number of SE-specific datasets to re-train the DL models, such as SeBERT, Text-To-Text Transfer Transformer(T5) model, Word2Vec-SO, BERT-reviews, BERT-SO-1M, BERT-SO-1M-Large, and RoBERTa-SO in Fig. 2.

2.3 Source Code Models

Source code models are the pre-trained models that are trained on source codes to understand the syntax and semantic information included in the source data. For now, researchers in SE have collected different language of source code to retrain the DL models, such as Code2Vec,CodeT5, CodeBERT, GraphCodeBERT, C-BERT, CuBERT, CodeBERT, and CodeBERT. PLBART, OSCAR, InferCode, and DOBF in Fig. 2.

3. Common Available SE Datasets

Datasets as one of important components in PTMs affect the performance of PTMs for the SE-related tasks. To get the higher performance of intelligent software engineering methods, researchers in SE have collected different types of SE datasets to train or fine-tune the models. To present and understand the current SE datasets, we summarized and analyzed these datasets in SE, and divided them into PTMs datasets and SE-related downstream Task Dataset.

3.1 PTMs Datasets

PTMs datasets are datasets that are used for Trained a DL model from scratch. The PTMs datasets frequently used in SE are listed in the table, which are also collected in our datasets files in this repository.

Type	Dataset	Programming Language	Source	Scale	Open Time	PTMs
PL	CodeSearchNet+C/C# datasets	Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ C/ C#	GitHub+BigQuery	8.35G	2021	CodeT5
	GitHub C language repositories	C	GitHub	5.8 G	2020	C-BERT
	Java and TypeScript datasets	Java/ TypeScript	GitHub		2020	CugLM
	Java datases	Java	GitHub		2021	SynFix
	CLCDSA dataset	Java/ C/ C++	AtCoder+CodeJam	17.6M	2019	IR-BERT
	Java datasets	Java	GitHub	32G	2020	InferCode Code2vec
	ETH Py150 Open corpus	Python	GitHub	190M	2020	CodeTrek
	unique Python files	Python	GitHub	159GB	2021	CodeX
	JavaSmall and JavaMed datasets	Java	GitHub	4.7M	2020	Coder
	Python and Java pre-training corpus	Java/ Python	GitHub	21.3M	2021	CuBERT/ TreeBERT
NL	SE textual data	English	Stack Overflow GitHub Jira	119.7G	2021	seBERT
NL	CoNLL-2003	English	Stack Overflow	3.16M	2020	BERTOverflow CosSensBERT
NL+ PL	Java datasets from CodeSearchNet+SO posts	Java/ English	GitHub+SO	52.5M	2022	T5
	Java datasets from CodeSearchNet+SO posts	Java/ English	GitHub	1.5M	2021	T5
	Java and Python from BigQuery+SO posts	Java/ Python/ English	BigQuery+GitHub	655G	2021	PLBART
	CodeSearchNet	Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ English	GitHub	3.5 G	2019	T-BERT Graphcodebert CodeBERT
	Python corpus of CodeSearchNet dataset	Python	GitHub	1.6M	2019	CLAWSAT/ CODE-MVP
	Java corpus of CodeSearchNet dataset	Java	GitHub	2.0M	2019	CLAWSAT
	AnghaBench	C	GitHub	0.53M	2020	COMBO

3.2 SE-related Downstream Task Dataset

SE-related downstream task datasets are the datasets used to fine tune the intellignet DL models for the se-related downstream tasks. Common SE-related downstream datasets frequently used are listed in the followed table.

Type	Tasks	Dataset	Programming Language	Scale	Open Time
PL	Code Classification	Java patches	Java	102041	2020
		POJ104	C/C++	30815	2021
		CodeCloneBench	Java	901028	2014
		SATD dataset			2016
		SmartBugs Wild Dataset		47398	2020
		SPI	C	298917	2021
		QEMU	C/C++	13600	2005
		FFmpeg	C/C++	4919	2006
		Devign	C	27318	2021
		Merge Conflcts Dataset	C#/JavaScript/TypeScript/Java	219934	2022
		Multi-language Commit Message Dataset (MCMD)	Java/C#/C++/Python/JavaScript	2250000	2022
		Vuldeepecker	C/C++	61638	2018
		Draper	C/C++	1274366	2018
		REVEAL	C/C++	18169	2020
		muVuldeepecker (MVD)	C/C++	181641	2019
		D2A	C/C++	1295623	2021
	Program Repair	ManySStuBs4J	Java	63923	2021
		Automatic Bug Fixing	Java	46680	2019
		TFix-dataset	JavaScript	104804	2021
		QuixBugs	Python/Java		2017
		CoCoNut	Java/Python/C/JavaScript	9675342	2020
		BugAID	JavaScript		2016
		ManyBugs	C	10468	2015
	Code Completion	Java and TypeScript datasets	Java/TypeScript		2020
	Code Completion	ETH Py150 corpus	Python	74749	2020
	API Recommendation	Req2Lib-dataset	Java	5625	2020
	Code Translation	Coode-code (CodeTrans)	Java/C#	10300	2021
	Code Translation	Python800 dataset	Python	240000	2021
NL	Text Classification	Herzig's issue report datasets	English		2012
		Commit messages	English	1793	2021
		issue report from GitHub	English		2021
		SEntiMoji dataset		10096	2019
	Review Responses Automatic Generation	review-response pairs datasets	English	570881	2020
	Link Prediction	traceability dataset	English	1834	2021
PL+NL	Code Generation	Concode data	Java	100,000	2018
		DJANGO	Python	18805	2015
		JUICE-10K	Python	13946	2019
		MBPP	Python	974	2021
		Spider	SQL	5693	2018
		APPS	Python	232421	2021
		CodeContests	C++/Python/Java	13610	2022
		HumanEval	Python		2021
	Code Summarization	Code review comments (CR)		1600	2017
		Code Summarization(CS)	Java	1953940	2020
		CodeSearchNet	Ruby/JavaScript/GO/Python/Java/PHP		2019
		Java projects from GitHub	Java	134239
		PY150	Python	30 000	2016
	Code Search	AdvTest dataset	Python	280634	2021
		CoNaLa	Python/Java	79809	2018
		SO-DS	Python	13250	2020
		StaQC	Python	147546	2018
		CoSQA	Python	20604	2021
	Code Review	CodeReview data	Python/Java/Go/C++/JavaScript/C/C#/Php/Ruby		2022
	Synthesis	CodeXGLUE			2021
CV	UML Diagram Classification	UML Diagram		14815	2016

4. SE-related tasks that used the PTMs and their performance

Researchers in SE have applied many PTMs into various SE-related tasks because of the powerful learning ability of PTMs. We summarized and analyzed these SE-related tasks with the PTMs. Meanwhile, based on the types of input datasets, we divided them into four types: programming language (PL) related tasks, natural language (NL) related tasks in the SE domain, the interaction task among PL and NL, and image related tasks in the SE domain.

Fig.3 Distribution of downstream tasks with pre-trained models in software engineering

4.1 PL-related tasks

PL-related tasks are the tasks to solve the problems through studying the syntactic and semantic feature representations of source code. The current main tasks and performance are listed in the followed table.

PL	Sub-Tasks	PTMs	Accuracy	Precision	Recall	F1	MAP	BLEU	EM	CodeBLEU	MCC	EditSIM	Number of fixed bugs
Code classification	Commit classifcation	BERT	0.80	0.84	0.75	0.79
		seBERT		0.87	0.85	0.84
		BERToverflow		0.84	0.81	0.81
		BERT-BASE		0.77	0.73	0.75
	Algorithm classification	CodeBERT		0.85			0.83
		RoBERTa		0.83			0.80
		COMBO					0.74
		ResNet18	86.4	85.8	84.7	82.2
		ResNet50	0.90	0.86	0.87	0.86
	Technical Debt Detection	BERT				0.82
		BERT-SO-1M				0.82
		StackOBERTflow				0.81
		BERT-comments				0.81
	Vulnerability Detection	GraphCodeBERT		0.92	0.92	0.92
		CuBERT	0.72
		COMBO				0.67
		C-BERT	0.62
		CodeBERT	0.64			0.54
		PLBART	0.63
		ResNet18		0.89	0.89	0.89
		ResNet50		0.91	0.91	0.91
	Defect Detection	CodeT5-base	0.66
		CodeT5-small	0.63
		CodeT5	0.64			0.60					0.27
		PLBART	0.63
		CuBERT	0.95
		RoBERTa (code)	0.61
		BERT	0.76
		CodeBERT	0.68			0.54					0.27
		CodeBERTa	0.70			0.59					0.27
		GraphCodeBERT	0.71
		CODE-MVP	0.89
		SynCoBERT	0.65
	Clone Detection	RoBERTa		0.97	0.96	0.96
		CodeBERT	0.97	0.96	0.96	0.96	0.10
		GraphCodeBERT	0.97	0.97	0.97	0.97
		CodeT5-small				0.97
		CodeT5-base			0.95	0.97
		RoBERTa (code)				0.95
		PLBART			0.95	0.97
		code2vec		0.82	0.40	0.60
		T5					0.70
		OSCAR					0.49
		COMBO			0.64
		InferCode		0.90	0.56	0.75
		SynCoBERT		0.97	0.98	0.97	0.88
		SCodeR		0.95	0.96	0.95	0.92
		UniXcoder		0.98	0.93	0.95	0.91
Program Repair		CodeT5-base						0.77	0.22
		CodeT5-small						0.76	0.19
		RoBERTa	0.75
		RoBERTa (code)						0.77	0.16
		CodeBERT	0.72

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
dataset		dataset
papers		papers
pictures		pictures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTM4SE

1. Research Framework

2. PTMs in SE

2.1 General pre-trained Models

2.2 Domain-specific Models

2.3 Source Code Models

3. Common Available SE Datasets

3.1 PTMs Datasets

3.2 SE-related Downstream Task Dataset

4. SE-related tasks that used the PTMs and their performance

4.1 PL-related tasks

About

Releases

Packages

Contributors 2

Languages

OpenSELab/PTM4SE

Folders and files

Latest commit

History

Repository files navigation

PTM4SE

1. Research Framework

2. PTMs in SE

2.1 General pre-trained Models

2.2 Domain-specific Models

2.3 Source Code Models

3. Common Available SE Datasets

3.1 PTMs Datasets

3.2 SE-related Downstream Task Dataset

4. SE-related tasks that used the PTMs and their performance

4.1 PL-related tasks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages