In the Python ecosystem, the adoption of idiomatic constructs has been fostered because of their expressiveness, increasing productivity and even efficiency, despite controversial arguments concerning familiarity or understandability issues. Recent research contributions have proposed approaches---based on static code analysis and transformation---to automatically identify and enact refactoring opportunities of non-idiomatic code into idiomatic ones. Given the potential recently offered by Large Language Models (LLMs) for code-related tasks, in this paper, we present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic refactoring actions. Our results reveal that GPT-4 not only identifies idiomatic constructs effectively but frequently exceeds the benchmark in proposing refactoring actions where the existing baseline failed. A manual analysis of a random sample shows the correctness of the obtained recommendations. Overall, our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
-
Unzip both
Data.zip
andResults.zip
.- The
Data
directory contains the benchmark files, which can be found at https://github.com/anonymousdouble/PythonicIdiomsRefactoring/.
- The
-
To generate Pythonic idioms using GPT-4, run the
main.py
file. -
For performance evaluation, execute the
metrics.py
file after the generation process is complete.
Note: Refer to the documentation within each file before running them.
The results of the generation are stored in the Results
directory. Its structure is as follows:
This directory contains all generation output files. Each file corresponds to a specific Pythonic idiom and is structured as follows:
file_html
: URL of the original code.method_content
: Code of the original method.file_name
: Name of the file containing the original method.lineno
: Starting line of the method in the file.old_code
: Code withinmethod_content
that needs refactoring.bench_code
: Refactored code proposed in the benchmark.count_bench
: Number of refactorings proposed by the benchmark.gpt_code
: Refactored code generated by GPT-4.count_gpt
: Number of refactorings proposed by GPT-4.text
: GPT-4's response excluding the code (text-only).answer
: Complete response from GPT-4 (code + text).
This directory contains filtered output files categorized by the number of refactorings proposed:
bench_more
: Cases where the benchmark proposed more refactorings than GPT-4.equals
: Cases where GPT-4 and the benchmark proposed the same number of refactorings.gpt_more
: Cases where GPT-4 proposed more refactorings than the benchmark.zero
: Cases where GPT-4 did not propose any refactoring.
This directory contains two .csv
files—one for GPT-4 and one for the benchmark. Each file contains the same columns as in all_refactorings
, with the following additional columns:
correct
: Number of refactorings that are correct.wrong
: Number of refactorings that are incorrect.