Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs

Abstract

In the Python ecosystem, the adoption of idiomatic constructs has been fostered because of their expressiveness, increasing productivity and even efficiency, despite controversial arguments concerning familiarity or understandability issues. Recent research contributions have proposed approaches---based on static code analysis and transformation---to automatically identify and enact refactoring opportunities of non-idiomatic code into idiomatic ones. Given the potential recently offered by Large Language Models (LLMs) for code-related tasks, in this paper, we present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic refactoring actions. Our results reveal that GPT-4 not only identifies idiomatic constructs effectively but frequently exceeds the benchmark in proposing refactoring actions where the existing baseline failed. A manual analysis of a random sample shows the correctness of the obtained recommendations. Overall, our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.

Run the Experiment

Unzip both Data.zip and Results.zip.
- The Data directory contains the benchmark files, which can be found at https://github.com/anonymousdouble/PythonicIdiomsRefactoring/.
To generate Pythonic idioms using GPT-4, run the main.py file.
For performance evaluation, execute the metrics.py file after the generation process is complete.
Note: Refer to the documentation within each file before running them.

Results

The results of the generation are stored in the Results directory. Its structure is as follows:

1. `all_refactorings`

This directory contains all generation output files. Each file corresponds to a specific Pythonic idiom and is structured as follows:

file_html: URL of the original code.
method_content: Code of the original method.
file_name: Name of the file containing the original method.
lineno: Starting line of the method in the file.
old_code: Code within method_content that needs refactoring.
bench_code: Refactored code proposed in the benchmark.
count_bench: Number of refactorings proposed by the benchmark.
gpt_code: Refactored code generated by GPT-4.
count_gpt: Number of refactorings proposed by GPT-4.
text: GPT-4's response excluding the code (text-only).
answer: Complete response from GPT-4 (code + text).

2. `performance_evaluation`

This directory contains filtered output files categorized by the number of refactorings proposed:

bench_more: Cases where the benchmark proposed more refactorings than GPT-4.
equals: Cases where GPT-4 and the benchmark proposed the same number of refactorings.
gpt_more: Cases where GPT-4 proposed more refactorings than the benchmark.
zero: Cases where GPT-4 did not propose any refactoring.

3. `correctness_evaluation`

This directory contains two .csv files—one for GPT-4 and one for the benchmark. Each file contains the same columns as in all_refactorings, with the following additional columns:

correct: Number of refactorings that are correct.
wrong: Number of refactorings that are incorrect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs

Abstract

Run the Experiment

Results

1. `all_refactorings`

2. `performance_evaluation`

3. `correctness_evaluation`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Data.zip		Data.zip
README.md		README.md
Results.zip		Results.zip
gpt_model.py		gpt_model.py
main.py		main.py
metrics.py		metrics.py
utils.py		utils.py

AleMidolo/GPTIdiomRefactoring

Folders and files

Latest commit

History

Repository files navigation

Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs

Abstract

Run the Experiment

Results

1. all_refactorings

2. performance_evaluation

3. correctness_evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `all_refactorings`

2. `performance_evaluation`

3. `correctness_evaluation`

Packages