論文ID: 2025EDP7092
As one of the most widely used programming languages in modern software development, Python hosts a vast open-source codebase on GitHub, where code reuse is widespread. This study leverages open-source Python projects from GitHub and applies automated testing to discover pairs of functionally equivalent methods. We collected and processed methods from 5,100 Python repositories, but Python's lack of static type checking presented unique challenges for grouping these methods. To address this, we conducted detailed type inference and organized methods based on their inferred types, providing a structured foundation for subsequent analysis. We then employed automated test generation to produce unit tests for each method, running them against one another within their respective groups to identify candidate pairs that yielded identical outputs from the same inputs. Through manual verification, we ultimately identified 68 functionally equivalent method pairs and 683 functionally non-equivalent pairs. These pairs were compiled into a comprehensive dataset, serving as the basis for further examination. With this dataset, we not only evaluated the ability of large language models (LLMs) to recognize functional equivalence, evaluating both their accuracy and the challenges posed by diverse implementations, but also conducted a systematic performance evaluation of equivalent methods, measuring execution times and analyzing the underlying causes of efficiency differences. The findings demonstrate the potential of LLMs to identify functionally equivalent methods and highlight areas requiring further advancement.